-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Independent builds do necessarily not produce consistent results #25
Comments
Which tests are not reproducing? Is it some of the tests in this (NOAA-GFDL/SHiELD_build) repository? Or is it a test in the https://github.com/ai2cm/SHiELD-minimal/tree/main repository? Only reason I ask is because these tests in NOAA-GFDL/SHiELD_build/RTS/CI are known to not reproduce because of the |
Thanks @laurenchilutti -- it is a custom test in the SHiELD-minimal repository that I set up. The namelist parameters are defined here; |
Hi, Spencer. When you say that independent builds give inconsistent
results, do you mean the answers change after you recompile, but not from
run-to-run (without recompiling)?
There are a couple of random seeds in the SHiELD physics that may change
upon recompilation.
Thanks
Lucas
…On Mon, Sep 25, 2023 at 8:05 AM Spencer Clark ***@***.***> wrote:
Thanks @laurenchilutti <https://github.com/laurenchilutti> -- it is a
custom test in the SHiELD-minimal repository that I set up. The namelist
parameters are defined here
<https://github.com/ai2cm/SHiELD-minimal/blob/main/tests/default.yml>;
add_noise is not set meaning that it takes on its default value of -1.
—
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVDLFALO7R5Z6UD3OELX4FXRDANCNFSM6AAAAAA5EGFHJY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks @lharris4, that's correct, the answers only have the potential to change when I recompile (and for that it seems like they take on the value from one of just two states). For a given executable, the model seems to produce consistent results (as evidenced by this test, which runs the executable 5 times and checks that it gets the same result). It is surprising that a random seed would change at compile time. Is this relevant only when using specific schemes? I.e. for testing purposes should I try running in a different configuration? For example, we do not seem to have this problem in FV3GFS. |
I believe the only scheme that would use the random seed is the cloud overlap scheme, although some versions of the convection also use a random seed. I do know that the random seed was set up in a way to ensure run-to-run consistency/reproducibility. You can try a 1-timestep test (run length the same as dt_atmos) and compare restart files to get an idea where precisely the reproducibility problem appears. As to why it would only change across recompiles and not re-runs, I am not sure. |
I tried the single timestep test:
Differences are extremely small and appear at only a limited number of grid points. I am currently seeing if I can reproduce this behavior (inconsistent results between clean compiles) with Intel compilers on Gaea with the exact same test case. |
Through four independent builds and runs with Intel compilers on Gaea, I am not able to reproduce this issue with this exact test case (i.e. I always get the same answer), which points to an issue in the interaction between my docker environment (which I think is fairly innocuous?), the GNU compilers, and SHiELD_build. |
The same is true if I use GNU compilers on Gaea. |
Hi, Spencer. Thank you for checking closely on this.
Lucas
…On Mon, Sep 25, 2023 at 3:28 PM Spencer Clark ***@***.***> wrote:
Through four independent builds and runs with Intel compilers on Gaea, I
am not able to reproduce this issue with this exact test case (i.e. I
always get the same answer)
The same is true if I use GNU compilers on Gaea.
—
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVEBQ4YR7336OXIQMLTX4HLNJANCNFSM6AAAAAA5EGFHJY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It appears that if I upgrade the GNU compilers in my container from version 11.4 to version 12.3 (more consistent with Gaea C5, which uses 12.2), I am able to obtain reproducible builds (at least through five consecutive clean builds); see ai2cm/SHiELD-minimal#4. I will keep this open until I see this confirmed in a few more build cycles, but this seems like a promising way forward. |
Thanks for the update Spencer. I know at one time MSD found what we believe was a compiler bug with gcc 11.1 when testing it as part of our FMS CI. We had subsequently tested v11.3 successfully, but have no data for v11.4 |
Thanks @bensonr -- I have further traced this back to the fact that the version of MPICH available from the package manager in the Ubuntu LTS 22.04 image automatically uses link time optimization with its compiler wrappers. If I manually disable it by adding That could be old news to MSD folks, but I’m posting this here in case anyone else comes across this (the lesson seems to be: beware of link time optimization if reproducibility is important). |
Hi, Spencer. Thanks for hunting that down. The joy of evolving user
packages :-(
Lucas
…On Thu, Sep 28, 2023 at 9:15 AM Spencer Clark ***@***.***> wrote:
Thanks @bensonr <https://github.com/bensonr> -- I have further traced
this back to the fact that the version of MPICH available from the package
manager in the Ubuntu LTS 22.04 image automatically uses link time
optimization <https://gcc.gnu.org/onlinedocs/gccint/LTO.html> with its
compiler wrappers. If I manually disable it by adding -fno-lto to the
FFLAGS and CFLAGS, I get reproducible builds with version 11.4 GNU
compilers; see ai2cm/SHiELD-minimal#6
<ai2cm/SHiELD-minimal#6> for more details /
context.
That could be old news to MSD folks, but I’m posting this here in case
anyone else comes across this (the lesson seems to be: beware of link time
optimization if reproducibility is important).
—
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVFWFXPG3CK2E5FQZVTX4VZ7PANCNFSM6AAAAAA5EGFHJY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@spencerkclark - thanks for finding this and bringing it to our attention. I don't think we've seen this before, so I will make sure to alert the team about potential issues with GNU compiles. |
Sounds good @bensonr. I am closing this issue, as I am confident now in the cause and how to work around it (currently just falling back to using an older Ubuntu LTS image, i.e. 20.04, in which MPICH does not add options related to link time optimization by default). The overriding flags are of course an option if using a newer Ubuntu LTS image. |
Is your question related to a problem? Please describe.
As part of a more involved development project, I am building and running SHiELD in a docker image using GNU compilers. A test I am running depends on the model consistently producing bitwise identical results for a given configuration. I am puzzlingly finding that the answers the model produces change depending on the build. Specifically they seem to flip randomly between two states.
This repository minimally illustrates my setup. It contains a
Dockerfile
which is used to build SHiELD using theCOMPILE
script in this repository, submodules for the relevant SHiELD source code, and some infrastructure for testing the model within the docker image. The README should contain all the information necessary for reproducing the issue locally (at least on a system that supports docker). The upshot is that the regression tests, which check for bit-for-bit reproducibility across builds, do not always pass.Describe what you have tried
I am a bit stumped at this point, so my idea here was to try and distill things to a minimal reproducible example, and reach out to see if there was something obvious I am doing wrong. Is there an issue in my environment or how I am configuring the build that is leading to this problem? I am happy to provide more information where needed. I appreciate your help!
The text was updated successfully, but these errors were encountered: