Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults related to injector's position copy #625

Closed
Tissot11 opened this issue May 12, 2023 · 26 comments
Closed

Segmentation faults related to injector's position copy #625

Tissot11 opened this issue May 12, 2023 · 26 comments
Labels

Comments

@Tissot11
Copy link

Hi,

I am again having segmentation faults similar to what I reported in #611 for 1D simulations. The issue was certainly fixed for 1D simulations but now I use the same parameters in a 2D setup with periodic boundary conditions in y-directions for both particles and EM fields. I also tried with PML boundary conditions but again segmentation faults. I paste below part of .out file

Stack trace (most recent call last):
#12 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#11 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x477c28, in _start
#10 Object "/lib64/libc.so.6", at 0x2b6629efeac4, in __libc_start_main
#9 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9f0984, in main
#8 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b22564, in __kmpc_fork_call
#7 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b6773c, in __kmp_fork_call
#6 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b66472, in
#5 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629bf6b12, in __kmp_invoke_microtask
#4 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9ef2b0, in main
#3 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x8ac14b, in VectorPatch::dynamics(Params&, SmileiMPI*, SimWindow*, RadiationTables&, MultiphotonBreitW$
#2 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x8ac6dc, in VectorPatch::dynamicsWithoutTasks(Params&, SmileiMPI*, SimWindow*, RadiationTables&, Multi$
#1 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0xa2ccac, in Species::dynamics(double, unsigned int, ElectroMagn*, Params&, bool, PartWalls*, Patch*, S$
#0 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9c7dd9, in Projector2D2Order::currentsAndDensityWrapper(ElectroMagn*, Particles&, SmileiMPI*, int, in$
Segmentation fault (Address not mapped to object [0x801afadf8])

Any suggestions on how to proceed further...

@Tissot11 Tissot11 added the bug label May 12, 2023
@mccoys
Copy link
Contributor

mccoys commented May 12, 2023

Are you using the master branch? develop branch? Did you update recently?

The bug seems different to me. Does it happen in 1 configuration only? Only with injectors?

@Tissot11
Copy link
Author

I fetched it in March from the devel brach where you have pushed it. But I have also downloaded it from the usual git clone link so it must be the master branch. This bug seems to only happen with particle injectors. Other 2D simulation (a different plasma physics problem) without injectors seem to running fine.

@mccoys
Copy link
Contributor

mccoys commented May 13, 2023

Do you have a minimal input file for reproducing the bug?

@Tissot11
Copy link
Author

This is the 2D version of the same input file that is crashing.

d15_th75_mi25.py.txt

@Tissot11
Copy link
Author

Just to point out that this namelist also works fine if I use older version of Smilei (v4.6) downloaded last year with older versions of intel compiler and intel MPI.

@mccoys
Copy link
Contributor

mccoys commented May 17, 2023

I made a much faster input file to reproduce the bug

# ----------------------------------------------------------------------------------------
#                     SIMULATION PARAMETERS
# ----------------------------------------------------------------------------------------
import math, random, os
import numpy as np                    
l0 = 1.0   
Lx = 512.0* l0                    
Ly = 8.0*l0                                         
tsim = 1.0*10**2                   
loc = 256.0* l0                  
dx = l0/5.                       
dy = l0/2. 
mi = 25.0                         
#mAlfven = 63.                      
#mSonic = 70.                       
vUPC = 0.15                        
angleDegrees = 75.0                

nUP = 1.0
u1x = 0.15
u1y = 0.0
TeUP = 0.001175
TiUP = 0.001175
B1x = 0.009859
B1y = 0.03679
E1z = ( - u1x*B1y + u1y*B1x)             
ppcu = 16                                


nDown = 3.99
u2x =  0.0375
u2y = 0.0001886
TeDown = 0.2156
TiDown = 0.8627
B2x = 0.00985
B2y = 0.14700
E2z = ( -u2x*B2y + u2y*B2x)                  
ppcd = 16                               

xin = -6*dx                  
yin = -6*dy  
               
slope1 = 50.0                
slope2 = 100.0               

dt = float(0.95/np.sqrt( dx**-2 + dy**-2))

Main(
    geometry = "2Dcartesian",
    
    interpolation_order = 2,
    
    timestep = dt,
    simulation_time = tsim,
    
    cell_length = [dx, dy],
    grid_length  = [Lx, Ly],
    number_of_patches = [ 16, 2 ],
    
    EM_boundary_conditions = [ ['silver-muller','silver-muller'], ["periodic","periodic"] ] ,

)


def upStreamDens(x,y):
    return nUP* 0.5* ( 1 + np.tanh( - ( x - loc ) / slope1 ) )
    

Species(
    name = 'eon1',
    position_initialization = 'random',
    momentum_initialization = 'maxwell-juettner',
    particles_per_cell = 0,
    mass = 1.0,
    charge = -1.0,
    number_density = upStreamDens,
    mean_velocity = [u1x,u1y,0.0],
    temperature = [TeUP],
    boundary_conditions = [
       ["remove", "remove"], ["periodic","periodic"] ],
)
       

Species(
    name = 'ion1',
    position_initialization = 'random',
    momentum_initialization = 'mj',
    particles_per_cell = 0,
    mass = mi, 
    charge = 1.0,
    number_density = upStreamDens,
    mean_velocity = [u1x,u1y,0.0],
    temperature = [TiUP],
    boundary_conditions = [
    	["remove", "remove"], ["periodic","periodic"] ],
)

ParticleInjector(
    name      = "Inj_eon1",
    species   = "eon1",
    box_side  = "xmin",
    position_initialization = "random",
    mean_velocity = [u1x,u1y,0.0],
    number_density = nUP,
    particles_per_cell = ppcu,
)
ParticleInjector(
    name      = "Inj_ion1",
    species   = "ion1",
    box_side  = "xmin",
    position_initialization = "Inj_eon1",
    mean_velocity = [u1x,u1y,0.0],
    number_density = nUP,
    particles_per_cell = ppcu,
)

@mccoys
Copy link
Contributor

mccoys commented May 17, 2023

Ok so the issue lies in the position copy from the 1st to the 2nd injector. Are you sure this worked in 1D? I have not checked but it seems the error would be the same.

@Tissot11
Copy link
Author

It did work in 1D last time (in March) and also now. But there was one strange thing that I didn't inform you before, on one machine (using older intel compiler, mpi and hdf5), I could run the simulation in 1D too without your fix in March. On this machine current 1D version (from Github) is still working but I haven't checked on other machines. 2D version is crashing anyway.

@mccoys
Copy link
Contributor

mccoys commented May 17, 2023

I just pushed a fix in develop. Basic tests are passing, but I did not check the physical picture is correct. Could you please tell me if everything works as you expect it?

@mccoys mccoys changed the title Segmentation faults Segmentation faults related to injector's position copy May 17, 2023
@Tissot11
Copy link
Author

I have checked it on two systems and simulations seem to be running fine. I'll analyse the results and let you know if there is any concern.

@Tissot11
Copy link
Author

An update: simulations using Intel compiler + Intel MPI are working fine. However, on Juwels supercomputer, using
GCC/11.3.0 OpenMPI/4.1.4 HDF5/1.12.2, and also using Intel/2022.1.0 ParaStationMPI/5.8.0-1-mt HDF5/1.12.2 either I see segmentation faults or simulation getting stuck (not at a fixed simulation time) using the same namelist. Which compilers and MPI versions do you recommend?

@mccoys
Copy link
Contributor

mccoys commented May 23, 2023

Are these the same segfault as before? Are you sure you got the same smilei version on both?

These compilers should be ok

@Tissot11
Copy link
Author

Tissot11 commented May 23, 2023

No, they are different. I attach one err file of a crashed simulation. The other one just got stuck in computing. I use the same Smilei version fetched from the develop branch last week on each machine. I'm not sure if this has something to with the modules installed on Juwels.

tjob_hybrid.err.7685072.txt

@mccoys
Copy link
Contributor

mccoys commented May 23, 2023

These might be related to memory limitations I guess. If the processors are different you may need to adapt the box decomposition

@Tissot11
Copy link
Author

Tissot11 commented May 25, 2023

Thanks. I'll try it. I attach the stderr file for the crashed simulation with Intel/2022.1.0 ParaStationMPI/5.8.0-1-mt HDF5/1.12.2 options. This message is different than the one before. The same simulation with Intel MPI is running fine.

tjob_hybrid.err.7713303.txt

@mccoys
Copy link
Contributor

mccoys commented May 25, 2023

We have never tested with parastation mpi, and not even with mpich. I recommend you keep using Intel mpi or open mpi

@Tissot11
Copy link
Author

Tissot11 commented May 25, 2023

I would always prefer to use Intel compiler and Intel MPI as I get a better performance compared to other combinations. However, on Juwels they plan to drop support for Intel MPI soon and recommend ParastationMPI or OpenMPI with GCC. I had trouble with both these combinations. I'm yet to try Intel + OpenMPI combination. Is this the combination you have tested Smilei already?

@mccoys
Copy link
Contributor

mccoys commented May 25, 2023

Yes we have done that combination in the past, but things are never simple and subtle parameters of the compiler may change things.

For instance, make sure that your mpi library was compiled with support for MPI_THREAD_MULTIPLE

@Tissot11
Copy link
Author

Tissot11 commented Jun 1, 2023

I'm having trouble at one machine where Smilei compiles fine but gets stuck at solving Poisson solver at t=0. I don't see any other errors and this is the same Namelist as pasted before. This Namelist never gets stuck at solving Poisson solver at t=0 everywhere else. On this cluster, user support and documentation is sub-par. Any suggestions, what to try? I used following modules for compiling and running.

module load compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12

I had to pass --mpi=pmi2 after srun command on this machine for the simulation to start.

@mccoys
Copy link
Contributor

mccoys commented Jun 1, 2023

Have you tried to compile with make config=no_mpi_tm?

@Tissot11
Copy link
Author

Tissot11 commented Jun 1, 2023

No. With option I should only use MPI processes and not OpenMP threads for running the simulations?

@mccoys
Copy link
Contributor

mccoys commented Jun 1, 2023

You can use openMP as usual. This option simply disable some capability by your MPI library but smilei can still run with MPI + openMP.

@Tissot11
Copy link
Author

Tissot11 commented Jun 1, 2023

Ok, thanks. I try it now and let you know this evening.

@Tissot11
Copy link
Author

Tissot11 commented Jun 1, 2023

It doesn't help. It still gets stuck at Poisson solver at t=0.

@mccoys
Copy link
Contributor

mccoys commented Jun 1, 2023

This looks like a problem with MPI. Maybe try to run with 1 thread only, just to check if it is due to openmp instead.

You should try to run test applications for MPI on this machine. Check the Intel benchmark suite for instance.

Other than the configuration above, smilei does not have specific MPI requirements. I don't think we can be of much help here.

@Tissot11
Copy link
Author

Tissot11 commented Jun 1, 2023

Thanks. I'll try these suggestions. Since the original issue was already resolved, you can now close this ticket.

@mccoys mccoys closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants