Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steady increase in memory use when running createMdsMesh #397

Closed
Thomas-Ulrich opened this issue Oct 10, 2023 · 5 comments
Closed

Steady increase in memory use when running createMdsMesh #397

Thomas-Ulrich opened this issue Oct 10, 2023 · 5 comments

Comments

@Thomas-Ulrich
Copy link
Contributor

Hi,

I'm trying to generate large meshes (e.g. the one that I just generated had 657117285 tetra cells) with SimModeler.
The SimModeler parallel generation uses ~150Gb of memory for 10 ranks.
But when I then convert the mesh using this function:
https://github.com/SeisSol/PUMGen/blob/master/src/input/SimModSuite.h#L183C19-L183C32
I can see (with htop) that the used memory steadily increases until reaching 880Gb.
Could it be a memory leak?
I'm using [email protected]

Thanks in advance,
Thomas.

@cwsmith
Copy link
Contributor

cwsmith commented Oct 13, 2023

Sounds like it. Thank you for reporting. We'll run valgrind on a smaller mesh and see what it reports.

@cwsmith
Copy link
Contributor

cwsmith commented Oct 29, 2024

Hi @Thomas-Ulrich . The function linked above appears to only use the SimModSuite API without calls to pumi/apf. Is that correct?
I see the call to createMdsMesh(...) in this year old version: https://github.com/SeisSol/PUMGen/blob/a847d1cef4f7f9f41fe8c6bd7c11fb9f91edcfc4/src/input/SimModSuite.h#L183

@Thomas-Ulrich
Copy link
Contributor Author

Yes @davschneller created a new version of pumgen that does not depend on pumi to fix this memory problem.

@cwsmith
Copy link
Contributor

cwsmith commented Oct 31, 2024

I ran a 16 process, 28M tet mesh generation (using SimModSuite) and mesh conversion to pumi/mds case via the generate utility under the valgrind massif tool. There is nothing immediately concerning there; notes are below.

Valgrind memcheck is running now. I'll update this post when that completes.

memcheck reports that all processes are leaking the same amount of data:

==1542614== LEAK SUMMARY:
==1542614==    definitely lost: 235,866 bytes in 1,523 blocks
==1542614==    indirectly lost: 262,406 bytes in 2,876 blocks
==1542614==      possibly lost: 71,744 bytes in 163 blocks
==1542614==    still reachable: 23,371,512 bytes in 19,065 blocks
==1542614==         suppressed: 0 bytes in 0 blocks

At first glance there appear to be a handful of SimModSuite objects that are not being freed properly; likely a mistake on our end. I'm looking into this now.

summary

For this mesh, and generate from master @ 54b2a9a, there doesn't appear to be significant memory growth and there are no major leaks.

@Thomas-Ulrich Are you interested in debugging this further or are you all set with the non-pumi version of your tool?


massif output

Most of the traces look OK, but a handful of them have peak usage samples near the end of execution.

(cmd)smithc11@wilbur: /space/cwsmith/pumi229Release/createMeshMemUsage/28M_16p $ grep peak *.txt
massif.out.1525776.txt: Detailed snapshots: [8, 12, 14, 19, 23, 25, 29, 32, 35, 44, 51, 61, 71, 72, 73 (peak)]
massif.out.1525777.txt: Detailed snapshots: [3, 4, 5, 6, 20, 23, 25, 27, 29, 30, 31, 32 (peak), 39, 56, 66, 76, 86]
massif.out.1525778.txt: Detailed snapshots: [1, 12, 13, 18, 22, 23, 25, 42, 45, 54, 64, 65, 66 (peak)]
massif.out.1525779.txt: Detailed snapshots: [2, 17, 18, 21, 27, 28, 29, 31 (peak), 34, 37, 48, 55, 65, 75]
massif.out.1525780.txt: Detailed snapshots: [1, 17, 24, 25, 27 (peak), 30, 44, 56]
massif.out.1525781.txt: Detailed snapshots: [1, 6, 9, 12, 18, 21, 23, 29, 44, 48, 51, 52 (peak)]
massif.out.1525782.txt: Detailed snapshots: [3, 11, 12, 21, 24, 25, 27 (peak), 29, 36, 41, 49, 59, 69]
massif.out.1525783.txt: Detailed snapshots: [2, 9, 11, 13, 16, 18, 20, 24, 25, 26, 28 (peak), 40, 54, 64, 74, 84]
massif.out.1525784.txt: Detailed snapshots: [3, 9, 11, 13, 18, 24, 26 (peak), 45, 52, 62, 72]
massif.out.1525785.txt: Detailed snapshots: [4, 9, 11, 17, 19, 23, 24, 25 (peak), 30, 39, 48, 57, 67, 77]
massif.out.1525786.txt: Detailed snapshots: [1, 9, 11, 15, 18, 22, 25, 26, 28 (peak), 37, 47, 53, 63, 73, 83]
massif.out.1525787.txt: Detailed snapshots: [1, 6, 12, 13, 15, 22, 25, 26, 28, 35, 44, 52, 62, 72, 82, 89 (peak)]
massif.out.1525788.txt: Detailed snapshots: [6, 12, 24, 25, 26, 28 (peak), 53, 63, 73, 83]
massif.out.1525789.txt: Detailed snapshots: [2, 5, 7, 11, 13, 16, 22, 23 (peak), 26, 38, 42, 46]
massif.out.1525790.txt: Detailed snapshots: [1, 10, 22, 23, 24, 26 (peak), 39, 43, 57, 67]
massif.out.1525791.txt: Detailed snapshots: [11, 14, 15, 26, 27, 28 (peak), 34, 57, 67, 77]

For example, here is process 1525777 whose peak is not at the end and occurs during Simmetrix mesh generation. The increase towards the end is from calling rebuild as a part of writing the pumi mesh files (.smb).

(ins)smithc11@wilbur: /space/cwsmith/pumi229Release/createMeshMemUsage/28M_16p $ head -n 40 massif.out.1525777.txt
--------------------------------------------------------------------------------
Command:            /space/cwsmith/pumi229Release/buildPumiOptonSimonOmegaoff_/test/generate /space/cwsmith/pumi229Release/createMeshMemUsage/upright.smd 28M
Massif arguments:   (none)
ms_print arguments: massif.out.1525777
--------------------------------------------------------------------------------


    GB
3.615^                                        ##                              
     |                                        #                               
     |                                        #                               
     |                                      @@#                               
     |                                      @ #                               
     |                                      @ #                               
     |                                      @ #                               
     |                                      @ #                               
     |                                      @ #             :::              :
     |                               @      @ #          :: : :              :
     |                             ::@::@:@@@ #          :::: :              :
     |                         ::@@: @: @:@@@ #        :@:::: :         @::  :
     |                        @::@ : @: @:@@@ #  ::   ::@:::: :::::::@:@@:::::
     |                        @::@ : @: @:@@@ #  : :::::@:::: :::::::@:@@:::::
     |                        @::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@:::::
     |                        @::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@:::::
     |                        @::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@:::::
     |                      ::@::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@::::@
     |                  ::::::@::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@::::@
     |      @@::::::::::: ::::@::@ : @: @:@@@ # :: : :::@:::: :::::::@:@@::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   862.8

while process 1525787 has a peak near the end (second to last sample) which is dominated by memory usage to load balance the partitioned mesh via apf::ZoltanBalancer::balance(...). rebuild (as described above) also appears in the log, but is only represents about 9% of the heap usage.

(ins)smithc11@wilbur: /space/cwsmith/pumi229Release/createMeshMemUsage/28M_16p $ head -n 40  massif.out.1525787.txt
--------------------------------------------------------------------------------
Command:            /space/cwsmith/pumi229Release/buildPumiOptonSimonOmegaoff_/test/generate /space/cwsmith/pumi229Release/createMeshMemUsage/upright.smd 28M
Massif arguments:   (none)
ms_print arguments: massif.out.1525787
--------------------------------------------------------------------------------


    GB
2.196^                                                                       #
     |                                       @:@                             #
     |                                       @:@                             #
     |                                       @:@                             #
     |                                      @@:@               :             #
     |                                      @@:@            :::::            #
     |                                      @@:@            : @::        :   #
     |                                      @@:@            : @::       ::@ :#
     |                                      @@:@           :: @::::@:@:@::@::#
     |                                      @@:@          ::: @::: @:@:@::@ :#
     |                                      @@:@        : ::: @::: @:@:@::@ :#
     |                                    ::@@:@   :    ::::: @::: @:@:@::@ :#
     |                             ::::::@::@@:@ :::   :::::: @::: @:@:@::@ :#
     |                          @:@: ::::@::@@:@ : ::@::::::: @::: @:@:@::@ :#
     |                          @:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
     |                          @:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
     |                          @:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
     |                         @@:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
     |                      : :@@:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
     |  ::::::::::::::::::@::::@@:@: ::::@::@@:@:: ::@::::::: @::: @:@:@::@ :#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   993.1

Number of snapshots: 91
 Detailed snapshots: [1, 6, 12, 13, 15, 22, 25, 26, 28, 35, 44, 52, 62, 72, 82, 89 (peak)]

@Thomas-Ulrich
Copy link
Contributor Author

Thank you for looking into it. I don't think it needs to be debugged further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants