Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Independent builds do necessarily not produce consistent results #25

Closed
spencerkclark opened this issue Sep 23, 2023 · 15 comments
Closed
Labels
question Further information is requested

Comments

@spencerkclark
Copy link
Member

Is your question related to a problem? Please describe.

As part of a more involved development project, I am building and running SHiELD in a docker image using GNU compilers. A test I am running depends on the model consistently producing bitwise identical results for a given configuration. I am puzzlingly finding that the answers the model produces change depending on the build. Specifically they seem to flip randomly between two states.

This repository minimally illustrates my setup. It contains a Dockerfile which is used to build SHiELD using the COMPILE script in this repository, submodules for the relevant SHiELD source code, and some infrastructure for testing the model within the docker image. The README should contain all the information necessary for reproducing the issue locally (at least on a system that supports docker). The upshot is that the regression tests, which check for bit-for-bit reproducibility across builds, do not always pass.

Describe what you have tried

I am a bit stumped at this point, so my idea here was to try and distill things to a minimal reproducible example, and reach out to see if there was something obvious I am doing wrong. Is there an issue in my environment or how I am configuring the build that is leading to this problem? I am happy to provide more information where needed. I appreciate your help!

@spencerkclark spencerkclark added the question Further information is requested label Sep 23, 2023
@laurenchilutti
Copy link
Contributor

Which tests are not reproducing? Is it some of the tests in this (NOAA-GFDL/SHiELD_build) repository? Or is it a test in the https://github.com/ai2cm/SHiELD-minimal/tree/main repository? Only reason I ask is because these tests in NOAA-GFDL/SHiELD_build/RTS/CI are known to not reproduce because of the add_noise nml variable in the fv_core_nml:
d96_2k.solo.bubble
d96_2k.solo.bubble.n0
d96_2k.solo.bubble.nhK

@spencerkclark
Copy link
Member Author

Thanks @laurenchilutti -- it is a custom test in the SHiELD-minimal repository that I set up. The namelist parameters are defined here; add_noise is not set meaning that it takes on its default value of -1.

@lharris4
Copy link
Contributor

lharris4 commented Sep 25, 2023 via email

@spencerkclark
Copy link
Member Author

spencerkclark commented Sep 25, 2023

Thanks @lharris4, that's correct, the answers only have the potential to change when I recompile (and for that it seems like they take on the value from one of just two states). For a given executable, the model seems to produce consistent results (as evidenced by this test, which runs the executable 5 times and checks that it gets the same result).

It is surprising that a random seed would change at compile time. Is this relevant only when using specific schemes? I.e. for testing purposes should I try running in a different configuration? For example, we do not seem to have this problem in FV3GFS.

@lharris4
Copy link
Contributor

I believe the only scheme that would use the random seed is the cloud overlap scheme, although some versions of the convection also use a random seed. I do know that the random seed was set up in a way to ensure run-to-run consistency/reproducibility.

You can try a 1-timestep test (run length the same as dt_atmos) and compare restart files to get an idea where precisely the reproducibility problem appears. As to why it would only change across recompiles and not re-runs, I am not sure.

@spencerkclark
Copy link
Member Author

I tried the single timestep test:

  • In the fv_core.res restart files, only T, u, and v differ. Maximum absolute differences at a grid point are 2.8e-13, 1.8e-15, and 7.1e-15 respectively.
  • In the fv_tracer.res restart files, only sphum, ice_wat, and cld_amt differ. Maximum absolute differences at a grid point are 6.9e-18, 2.7e-19, and 7.8e-14, respectively.
  • In the fv_srf_wnd.res restart files, no fields differ.
  • In the sfc_data restart files, only the tprcp field differs. Maximum absolute difference at a grid point is 6.8e-19.

Differences are extremely small and appear at only a limited number of grid points.

I am currently seeing if I can reproduce this behavior (inconsistent results between clean compiles) with Intel compilers on Gaea with the exact same test case.

@spencerkclark
Copy link
Member Author

Through four independent builds and runs with Intel compilers on Gaea, I am not able to reproduce this issue with this exact test case (i.e. I always get the same answer), which points to an issue in the interaction between my docker environment (which I think is fairly innocuous?), the GNU compilers, and SHiELD_build.

@spencerkclark
Copy link
Member Author

Through four independent builds and runs with Intel compilers on Gaea, I am not able to reproduce this issue with this exact test case (i.e. I always get the same answer)

The same is true if I use GNU compilers on Gaea.

@lharris4
Copy link
Contributor

lharris4 commented Sep 25, 2023 via email

@spencerkclark
Copy link
Member Author

spencerkclark commented Sep 26, 2023

It appears that if I upgrade the GNU compilers in my container from version 11.4 to version 12.3 (more consistent with Gaea C5, which uses 12.2), I am able to obtain reproducible builds (at least through five consecutive clean builds); see ai2cm/SHiELD-minimal#4. I will keep this open until I see this confirmed in a few more build cycles, but this seems like a promising way forward.

@bensonr
Copy link
Contributor

bensonr commented Sep 27, 2023

Thanks for the update Spencer. I know at one time MSD found what we believe was a compiler bug with gcc 11.1 when testing it as part of our FMS CI. We had subsequently tested v11.3 successfully, but have no data for v11.4

@spencerkclark
Copy link
Member Author

Thanks @bensonr -- I have further traced this back to the fact that the version of MPICH available from the package manager in the Ubuntu LTS 22.04 image automatically uses link time optimization with its compiler wrappers. If I manually disable it by adding -fno-lto to the FFLAGS and CFLAGS, I get reproducible builds with version 11.4 GNU compilers; see ai2cm/SHiELD-minimal#6 for more details / context.

That could be old news to MSD folks, but I’m posting this here in case anyone else comes across this (the lesson seems to be: beware of link time optimization if reproducibility is important).

@lharris4
Copy link
Contributor

lharris4 commented Sep 28, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Sep 28, 2023

@spencerkclark - thanks for finding this and bringing it to our attention. I don't think we've seen this before, so I will make sure to alert the team about potential issues with GNU compiles.

@spencerkclark
Copy link
Member Author

Sounds good @bensonr. I am closing this issue, as I am confident now in the cause and how to work around it (currently just falling back to using an older Ubuntu LTS image, i.e. 20.04, in which MPICH does not add options related to link time optimization by default). The overriding flags are of course an option if using a newer Ubuntu LTS image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants