-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timing mis-match when threading #189
Comments
@sayerhs and @sthomas61, could you please post your timing overview for a select number of threaded cases that you have already run? Many thanks. |
|
For reference the attached file is the submit script I am using to submit jobs on Cori via
|
Thanks. If you can add the full set of timers, e.g., total, STKPERF, etc., that would be useful. I noticed the timers were wrong when I looked at my Cori times. I found at least one inconsistency in the usage of cpu_time: There may be certainly more lurking. However, before we tackle performance I think we need to have confidence in the timers. |
I will try to reproduce the discrepancy on the local blade where I can run threaded and also use vtune for another view of where the time is being spent. |
McAlister run on Cori Cori KNL: |
I think I have successfully reproduced this on my local blade with the cvfemHC nightly test. When I run with 8 mpi procs and 1 thread per mpi rank, the STKPERF time is very close to the "No-output time", which appears to be the sum of the timer line items for assembly, solve, etc. But when I run 8 mpi procs with 2 threads per mpi rank, then the 'No-output time' is way more (nearly twice) the STKPERF time, but the sum of the timer line items appears to match the STKPERF time. So it appears as though there is something wrong with the 'No-output time'. Perhaps it simply needs to be labeled better, perhaps it is "sum of cpu time". I notice that the bottom of the log file where it gives min/max/sum of both wall time and cpu time, the 'No-output time' is roughly the same as the cpu sum time, while the stkperf time is roughly the same as the wall max time. |
Sounds good. With my nalu.C change, feel free to close this if you feel that everything is looking in order for the threaded use case. |
It looks like there is some double-counting of time somewhere. For the hoHelium case if I sum the times printed from dump_eq_time() on equation-system, the sum is greater than total time (something like 1330 secs vs 900 secs) for the 8-proc case with 2 threads per mpi proc. I'm running again with 1 thread per proc to see if it's related to threading or not. |
There is some double-counting of time in the equation-system timers. An example is LowMachEquationSystem::solve_and_update calls momentumEqSys_-> compute_projected_nodal_gradient(), and time is accumulated both inside and outside that call. In general, I don't know how important this is to fix immediately. For our milestone timings, we can measure overall runtime reliably using either the stkperf time or the 'main()' time reported under 'Timing for Simulation'. But at some point I think we should clean up the way time is collected in the various equation-system derivations. There are quite a few places where those classes call methods on each other, and start/stop timers outside those calls, and sometimes add the time to the other class' timer member. A cleaner way would be to have every class only access its own timer members. But since the classes call each other there is still the potential for double-counting. Perhaps they need to only access a set of timers held at a higher level, such as realm or possibly just use child timers in the diag timer system? Summary: it's non-trivial to fix, we can discuss more. |
I will look more at this tomorrow and see if I find a path forward. However, it seems like I need to make sure that the assembly and solve is clean since this is so critical to the SIMD/threaded work. |
I agree that scrubbing the timers will take some time. Most EqSys modify either timerAssemble_ (for field-updates) or timerMisc_ (for wall function variables, initialization, etc). However, these seem to be scoped well. As far as I can tell, the calipers in Nalu are okay, however, there is no notion of nesting - as we see in the standard Sierra timers. As such, we are double counting here and there. This means that the sums do not close, however, the individual times for, e.g., assemble, seem to be correct. I would propose possibly transitioning to the sierra timers that have built in a notion of nesting from the beginning. We can discuss as needed offline. |
A while ago, I changed the timers in Nalu to use wall time, see:
spdomin/Nalu#41
However, when running a KNL/threaded/SIMD heat conduction case, I see a discrepancy in the STKPERF and STK::diag timers.
Moreover, there is unaccounted time spent somewhere in the simulation.
We need to resolve this for accurate timing benchmarking.
The text was updated successfully, but these errors were encountered: