Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing mis-match when threading #189

Open
spdomin opened this issue Sep 5, 2017 · 12 comments
Open

Timing mis-match when threading #189

spdomin opened this issue Sep 5, 2017 · 12 comments
Assignees
Labels

Comments

@spdomin
Copy link
Contributor

spdomin commented Sep 5, 2017

A while ago, I changed the timers in Nalu to use wall time, see:

spdomin/Nalu#41

However, when running a KNL/threaded/SIMD heat conduction case, I see a discrepancy in the STKPERF and STK::diag timers.

Moreover, there is unaccounted time spent somewhere in the simulation.

We need to resolve this for accurate timing benchmarking.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 6, 2017

@sayerhs and @sthomas61, could you please post your timing overview for a select number of threaded cases that you have already run? Many thanks.

@sayerhs
Copy link
Contributor

sayerhs commented Sep 6, 2017

cvfemHC run on NREL Peregrine

mpiexec -np 1 -map-by numa:PE=$OMP_NUM_THREADS

NP = 1; NT =  1; time = 3.15047
NP = 1; NT =  2; time = 1.90648
NP = 1; NT =  4; time = 1.32733
NP = 1; NT =  8; time = 1.00529
NP = 1; NT = 12; time = 0.973433

NP = 4; NT =  1; time = 0.895717
NP = 4; NT =  2; time = 0.669878
NP = 4; NT =  4; time = 0.530774
NP = 4; NT =  6; time = 0.521625
NP = 4; NT = 12; time = 0.412054

Milestone runs on Cori KNL

R3 mesh

OMP_NUM_THREADS=1 and SIMD enabled

Num. nodes     = 5
Ranks/node     = 64
Num. MPI Ranks = 320
Num. threads   = 1

Timing for Eq: MomentumEQS
             init --   	avg: 9.32076 	min: 9.00506 	max: 9.43626
         assemble --   	avg: 352.032 	min: 346.158 	max: 369.544
    load_complete --   	avg: 89.9271 	min: 72.5486 	max: 96.038
            solve --   	avg: 172.093 	min: 171.915 	max: 172.405
    precond setup --   	avg: 1.15501 	min: 1.04093 	max: 1.34501
             misc --   	avg: 67.6275 	min: 64.2702 	max: 71.2329
linear iterations --  	avg: 6.31 	min: 3 	max: 9
Timing for Eq: ContinuityEQS
             init --   	avg: 4.00154 	min: 3.8935 	max: 4.31504
         assemble --   	avg: 73.2754 	min: 72.0517 	max: 77.2341
    load_complete --   	avg: 11.6058 	min: 7.53058 	max: 12.8734
            solve --   	avg: 267.844 	min: 267.732 	max: 267.957
    precond setup --   	avg: 3.41102 	min: 3.40925 	max: 3.41267
             misc --   	avg: 93.3757 	min: 92.7854 	max: 93.9005
linear iterations --  	avg: 24.03 	min: 16 	max: 66
Timing for Eq: myTke
             init --   	avg: 3.90441 	min: 3.89822 	max: 3.9104
         assemble --   	avg: 101.531 	min: 100.053 	max: 104.843
    load_complete --   	avg: 11.1142 	min: 7.65569 	max: 12.6452
            solve --   	avg: 30.5247 	min: 30.3851 	max: 30.6953
    precond setup --   	avg: 0.306147 	min: 0.249461 	max: 0.35951
             misc --   	avg: 18.3381 	min: 17.9239 	max: 18.8291
linear iterations --  	avg: 6.19 	min: 6 	max: 7
Timing for Eq: myZ
             init --   	avg: 3.89162 	min: 3.88822 	max: 3.90603
         assemble --   	avg: 87.2831 	min: 85.8497 	max: 90.4471
    load_complete --   	avg: 10.6673 	min: 7.73012 	max: 12.1868
            solve --   	avg: 30.9104 	min: 30.758 	max: 31.0728
    precond setup --   	avg: 0.309746 	min: 0.253536 	max: 0.360611
             misc --   	avg: 18.1931 	min: 17.7951 	max: 18.6612
linear iterations --  	avg: 6.36 	min: 6 	max: 7
Timing for IO:
   io create mesh --    avg: 0.127286   min: 0.0903769  max: 0.149527
 io output fields --    avg: 1.13416    min: 0.990393   max: 1.30918
 io populate mesh --    avg: 4.37431    min: 4.37315    max: 4.3752
 io populate fd   --    avg: 0.137118   min: 0.121681   max: 0.16746
Timing for connectivity/finalize lysys:
         eqs init --    avg: 21.1016    min: 21.0948    max: 21.1157
Timing for property evaluation:
            props --    avg: 0.22788    min: 0.213932   max: 0.264211

Timing for Simulation: nprocs= 320
           main() --    avg: 1514.38    min: 1514.02    max: 1514.78
Memory Overview:
nalu memory: total (over all cores) current/high-water mark=       170.201 G      175.599 G
nalu memory:   min (over all cores) current/high-water mark=       489.422 M      500.238 M
nalu memory:   max (over all cores) current/high-water mark=       648.008 M      688.051 M
Min High-water memory usage 500.2 MB
Avg High-water memory usage 561.9 MB
Max High-water memory usage 688.1 MB

Min Available memory per processor 355.0 MB
Avg Available memory per processor 355.0 MB
Max Available memory per processor 355.0 MB

Min No-output time 1513.2880 sec
Avg No-output time 1513.6503 sec
Max No-output time 1514.0120 sec

STKPERF: Total Time: 1514.2400

STKPERF: Current memory: 619728896 (591 M)
STKPERF: Memory high water: 654135296 (623.8 M)
                                                      CPU Time              CPU Time              CPU Time              Wall Time              Wall Time             Wall Time
                 Timer                   Count   Sum (% of System)      Min (% of System)     Max (% of System)     Sum (% of System)      Min (% of System)     Max (% of System)
---------------------------------------- ----- ---------------------- --------------------- --------------------- ---------------------- --------------------- ---------------------
Nalu                                       320 134:36:58.140 (100.0%)    25:14.072 (0.312%)    25:14.852 (0.313%) 134:47:17.993 (100.0%)    25:16.327 (0.312%)    25:16.403 (0.313%)
  Output                                 32640      3:57.500 (0.049%)        0.636 (<0.01%)        0.832 (<0.01%)      6:06.836 (0.076%)        1.001 (<0.01%)        1.310 (<0.01%)

OMP_NUM_THREADS=4 and SIMD enabled

Timing for Eq: MomentumEQS
             init --   	avg: 9.99589 	min: 9.41529 	max: 10.3381
         assemble --   	avg: 209.155 	min: 203.251 	max: 225.619
    load_complete --   	avg: 71.7617 	min: 55.4674 	max: 77.1481
            solve --   	avg: 279.557 	min: 275.576 	max: 288.471
    precond setup --   	avg: 96.4698 	min: 87.7161 	max: 100.556
             misc --   	avg: 67.6853 	min: 64.5332 	max: 71.3672
linear iterations --  	avg: 6.39 	min: 3 	max: 9
Timing for Eq: ContinuityEQS
             init --   	avg: 4.44848 	min: 4.10739 	max: 5.03387
         assemble --   	avg: 52.7169 	min: 51.1537 	max: 55.7822
    load_complete --   	avg: 9.23072 	min: 6.06482 	max: 10.8432
            solve --   	avg: 291.393 	min: 291.232 	max: 291.495
    precond setup --   	avg: 3.33139 	min: 3.32898 	max: 3.33274
             misc --   	avg: 93.9669 	min: 93.259 	max: 94.5437
linear iterations --  	avg: 24.15 	min: 17 	max: 66
Timing for Eq: myTke
             init --   	avg: 4.15155 	min: 4.14291 	max: 4.1666
         assemble --   	avg: 78.1329 	min: 75.8163 	max: 81.3306
    load_complete --   	avg: 8.37531 	min: 4.96972 	max: 10.7142
            solve --   	avg: 40.7079 	min: 40.4596 	max: 41.2425
    precond setup --   	avg: 2.12437 	min: 1.68726 	max: 2.47353
             misc --   	avg: 19.307 	min: 18.5086 	max: 20.2199
linear iterations --  	avg: 6.19 	min: 6 	max: 7
Timing for Eq: myZ
             init --   	avg: 4.20123 	min: 4.18577 	max: 4.21865
         assemble --   	avg: 65.7189 	min: 63.3855 	max: 69.1568
    load_complete --   	avg: 9.62619 	min: 6.23328 	max: 11.8788
            solve --   	avg: 42.1167 	min: 41.7305 	max: 42.6231
    precond setup --   	avg: 2.26186 	min: 1.84747 	max: 2.61075
             misc --   	avg: 19.1199 	min: 18.388 	max: 19.9591
linear iterations --  	avg: 6.38 	min: 6 	max: 7
Timing for IO:
   io create mesh --    avg: 0.0985608  min: 0.0680749  max: 0.114286
 io output fields --    avg: 1.13279    min: 1.01908    max: 1.31905
 io populate mesh --    avg: 4.54962    min: 4.54801    max: 4.55121
 io populate fd   --    avg: 0.142953   min: 0.127554   max: 0.173288
Timing for connectivity/finalize lysys:
         eqs init --    avg: 22.7972    min: 22.7808    max: 22.8164
Timing for property evaluation:
            props --    avg: 0.251198   min: 0.212605   max: 0.382443

Timing for Simulation: nprocs= 320
           main() --    avg: 4973.9     min: 4676.17    max: 5125.88
Memory Overview:
nalu memory: total (over all cores) current/high-water mark=       232.196 G      264.317 G
nalu memory:   min (over all cores) current/high-water mark=       690.801 M      784.711 M
nalu memory:   max (over all cores) current/high-water mark=       870.621 M       978.18 M
Min High-water memory usage 784.7 MB
Avg High-water memory usage 845.8 MB
Max High-water memory usage 978.2 MB

Min Available memory per processor 355.0 MB
Avg Available memory per processor 355.0 MB
Max Available memory per processor 355.0 MB

Min No-output time 4675.3880 sec
Avg No-output time 4973.1384 sec
Max No-output time 5125.1640 sec

STKPERF: Total Time: 5013.1600

STKPERF: Current memory: 831344640 (792.8 M)
STKPERF: Memory high water: 954957824 (910.7 M)
                                                      CPU Time              CPU Time              CPU Time              Wall Time              Wall Time             Wall Time
                 Timer                   Count   Sum (% of System)      Min (% of System)     Max (% of System)     Sum (% of System)      Min (% of System)     Max (% of System)
---------------------------------------- ----- ---------------------- --------------------- --------------------- ---------------------- --------------------- ---------------------
Nalu                                       320 442:07:46.232 (100.0%)  1:17:56.216 (0.294%)  1:25:25.944 (0.322%) 131:28:22.387 (100.0%)    24:38.911 (0.312%)    24:39.154 (0.313%)
  Output                                 32640      4:10.332 (0.016%)        0.660 (<0.01%)        0.892 (<0.01%)      6:09.377 (0.078%)        1.057 (<0.01%)        1.349 (<0.01%)

@sayerhs
Copy link
Contributor

sayerhs commented Sep 6, 2017

For reference the attached file is the submit script I am using to submit jobs on Cori via sbatch. This was generated using Cori script generator as a starting point, and information from example batch script on NERSC Cori website.

Download

#!/bin/bash

#SBATCH --nodes=5
#SBATCH --time=1:00:00
#SBATCH --job-name=r3_simd_knl
#SBATCH --partition=regular
#SBATCH --constraint=knl,quad,cache
#SBATCH --account=m2853
#SBATCH --output=out.%x_%j
#SBATCH -L SCRATCH,project
#SBATCH --core-spec=4

module load cmake/3.8.2 zlib/1.2.8 cray-parallel-netcdf/1.6.1 cray-netcdf-hdf5parallel/4.3.3.1 cray-hdf5-parallel/1.8.16 boost/1.61
# Swap out haswell modules for KNL ones
module unload craype-haswell
module load craype-mic-knl

ranks_per_node=64
mpi_ranks=$(expr $SLURM_JOB_NUM_NODES \* $ranks_per_node)
export OMP_NUM_THREADS=4  # Max hardware threads = 4

nalu_exec=/project/projectdirs/m2853/codes/nalu/nalu_builds/knl_simd/naluX

echo "Job name       = $SLURM_JOB_NAME"
echo "Num. nodes     = $SLURM_JOB_NUM_NODES"
echo "Num. MPI Ranks = $mpi_ranks"
echo "Num. threads   = $OMP_NUM_THREADS"

srun -n ${mpi_ranks} -c 4 --cpu_bind=cores $nalu_exec -i ksgsConsolidatedSIMD.i -o log.knl_simd_${mpi_ranks}_${ranks_per_node}_${OMP_NUM_THREADS}

@spdomin
Copy link
Contributor Author

spdomin commented Sep 6, 2017

Thanks. If you can add the full set of timers, e.g., total, STKPERF, etc., that would be useful.

I noticed the timers were wrong when I looked at my Cori times. I found at least one inconsistency in the usage of cpu_time:

ed37911

There may be certainly more lurking. However, before we tackle performance I think we need to have confidence in the timers.

@alanw0
Copy link
Contributor

alanw0 commented Sep 6, 2017

I will try to reproduce the discrepancy on the local blade where I can run threaded and also use vtune for another view of where the time is being spent.

@sthomas61
Copy link
Contributor

McAlister run on Cori

Cori KNL:
McAlister Wing, MPI only, undersubscribed 66 cores per node,
ILUT smoother, HTS triangular solver:
25 nodes, 1650 cores, time: 960 secs
100 nodes, 6600 cores, time 350 secs

@alanw0
Copy link
Contributor

alanw0 commented Sep 6, 2017

I think I have successfully reproduced this on my local blade with the cvfemHC nightly test. When I run with 8 mpi procs and 1 thread per mpi rank, the STKPERF time is very close to the "No-output time", which appears to be the sum of the timer line items for assembly, solve, etc.

But when I run 8 mpi procs with 2 threads per mpi rank, then the 'No-output time' is way more (nearly twice) the STKPERF time, but the sum of the timer line items appears to match the STKPERF time.

So it appears as though there is something wrong with the 'No-output time'. Perhaps it simply needs to be labeled better, perhaps it is "sum of cpu time". I notice that the bottom of the log file where it gives min/max/sum of both wall time and cpu time, the 'No-output time' is roughly the same as the cpu sum time, while the stkperf time is roughly the same as the wall max time.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 7, 2017

Sounds good. With my nalu.C change, feel free to close this if you feel that everything is looking in order for the threaded use case.

@alanw0
Copy link
Contributor

alanw0 commented Sep 7, 2017

It looks like there is some double-counting of time somewhere. For the hoHelium case if I sum the times printed from dump_eq_time() on equation-system, the sum is greater than total time (something like 1330 secs vs 900 secs) for the 8-proc case with 2 threads per mpi proc. I'm running again with 1 thread per proc to see if it's related to threading or not.

@alanw0
Copy link
Contributor

alanw0 commented Sep 7, 2017

There is some double-counting of time in the equation-system timers. An example is LowMachEquationSystem::solve_and_update calls momentumEqSys_-> compute_projected_nodal_gradient(), and time is accumulated both inside and outside that call.

In general, I don't know how important this is to fix immediately. For our milestone timings, we can measure overall runtime reliably using either the stkperf time or the 'main()' time reported under 'Timing for Simulation'. But at some point I think we should clean up the way time is collected in the various equation-system derivations. There are quite a few places where those classes call methods on each other, and start/stop timers outside those calls, and sometimes add the time to the other class' timer member. A cleaner way would be to have every class only access its own timer members. But since the classes call each other there is still the potential for double-counting. Perhaps they need to only access a set of timers held at a higher level, such as realm or possibly just use child timers in the diag timer system? Summary: it's non-trivial to fix, we can discuss more.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 7, 2017

I will look more at this tomorrow and see if I find a path forward. However, it seems like I need to make sure that the assembly and solve is clean since this is so critical to the SIMD/threaded work.

@spdomin
Copy link
Contributor Author

spdomin commented Sep 8, 2017

I agree that scrubbing the timers will take some time.

Most EqSys modify either timerAssemble_ (for field-updates) or timerMisc_ (for wall function variables, initialization, etc). However, these seem to be scoped well.

As far as I can tell, the calipers in Nalu are okay, however, there is no notion of nesting - as we see in the standard Sierra timers. As such, we are double counting here and there. This means that the sums do not close, however, the individual times for, e.g., assemble, seem to be correct.

I would propose possibly transitioning to the sierra timers that have built in a notion of nesting from the beginning. We can discuss as needed offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants