Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FATES hydro test mod to remove temporary test failure workaround #2882

Open
wants to merge 8 commits into
base: tmp-241219
Choose a base branch
from

Conversation

glemieux
Copy link
Collaborator

@glemieux glemieux commented Nov 15, 2024

Description of changes

This PR reverts two commits that created a workaround to hydro issue NGEET/fates#1254.

Specific notes

Contributors other than yourself, if any: @XiulinGao

CTSM Issues Fixed (include github issue #): Fixes #2878

Are answers expected to change (and if so in what way)? Yes, only for the FatesColdHydro tests

Any User Interface Changes (namelist or namelist defaults changes)?

Does this create a need to change or add documentation? Did you do so?

Testing performed, if any: regular and fates

@glemieux glemieux added test: fates Pass fates test suite before merging FATES A change needed for FATES that doesn't require a FATES API update. labels Nov 15, 2024
@glemieux glemieux requested a review from rgknox November 18, 2024 17:18
This tag includes the fix to NGEET/fates#1254 and will allow the current
default fates parameter file to be used in fates hydro tests
@samsrabin samsrabin added the test: aux_clm Pass aux_clm suite before merging label Dec 5, 2024
@glemieux
Copy link
Collaborator Author

@slevis-lmwg @ekluzek if we can get this (and it's FATES tag update) into master before the next b4b-dev update, I might be able to then rebase #2904 to the b4b-dev branch since the b4b-dev branch will have the latest fates tag updated.

iRpointer files for restart now have the simulation timestamp in the filename

Add the simulation timestamp to the rpointer files. Also update submodules with this change
in CMEPS and CDEPS as well as updated cime to handle it. See the notes below for an explaination
about this.

Add a "clm" level directory under usermods_dirs so that the component where user-mods reside
is declared and to make them function the same as test-mods.
@glemieux glemieux changed the base branch from master to tmp-241219 January 2, 2025 19:13
@glemieux
Copy link
Collaborator Author

glemieux commented Jan 3, 2025

Regression testing on derecho is showing b4b results with the expected exception of the fates testmods. The results appear consistent with the science and bug fix updates made over the intervene fates tag updates.

There is one non-fates testmod that is failing build: SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop. It's failing during the lnd build step with the following:

...
579 nvfortran-Fatal-/glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/bin/tools/fort1 TERMINATED by signal 11
580 gmake: *** [/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_    nvh/Tools/Makefile:949: dynColumnStateUpdaterMod.o] Error 2
...
ERROR: Command gmake complib -j 16 COMP_NAME=clm COMPLIB=/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcC    rop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/bld/nvhpc/mpich/debug/nothreads/lib/libclm.a -f /glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2882-a    ux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/Tools/Makefile CIME_MODEL=cesm  SMP=FALSE CASEROOT="/glade/u/hom    e/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh" CASETOOLS="/glade/u    /home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/Tools" CIMEROOT=    "/glade/u/home/glemieux/ctsm/cime" SRCROOT="/glade/u/home/glemieux/ctsm" COMP_INTERFACE="nuopc" COMPILER="nvhpc" DEBUG="TRUE" EXEROOT="/glade/u/home/glem    ieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/bld" RUNDIR="/glade/u/home    /glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/run" INCROOT="/glade/    u/home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_nvh/bld/lib/include    " LIBROOT="/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.pr2882-aux_clm_n    vh/bld/lib" MACH="derecho" MPILIB="mpich" NINST_VALUE="c1a1l1r1" OS="CNL" PIO_VERSION=2 SHAREDLIBROOT="/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr    2882-aux_clm/sharedlibroot.pr2882-aux_clm_nvh" BUILD_THREADED="FALSE" USE_ESMF_LIB="TRUE" USE_MOAB="FALSE" COMP_ATM="datm" COMP_ICE="sice" COMP_GLC="sglc    " COMP_LND="clm" COMP_OCN="socn" COMP_ROF="mosart" COMP_WAV="swav" USE_TRILINOS="FALSE" USE_ALBANY="FALSE" USE_PETSC="FALSE"  failed with rc=2

Folder location: /glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2882-aux_clm

@glemieux
Copy link
Collaborator Author

glemieux commented Jan 3, 2025

Regression testing on izumi ran into some issues yesterday that I think are related to what I think was an unscheduled outage (I wasn't able to login in after submitting tests later in the day yesterday). I'm going to try and resubmit these today. On a positive note, it looks like all the mpi-serial tests ran successfully instead of hitting #2915.

@glemieux
Copy link
Collaborator Author

glemieux commented Jan 3, 2025

It looks like my speculation was incorrect regarding the failed runs. Re-submission did not alleviate the issues. I also attempted to generate a new baseline for tmp-241219.n01.ctsm5.3.016. I'm seeing the same failures in both:

FAIL SMS_D_Ln1.f10_f10_mg37.I2000Clm50BgcCropQianRs.izumi_intel.clm-run_self_tests NLCOMP
FAIL SMS_D_Ln1.f10_f10_mg37.I2000Clm50BgcCropQianRs.izumi_intel.clm-run_self_tests BASELINE tmp-241219.n01.ctsm5.3.016_1st.attempt: ERROR CPRNC failed to open files
FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm50Bgc.izumi_nag.clm-ciso RUN time=195
FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso--clm-matrixcnOn RUN time=257
FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso RUN time=234
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop RUN time=38
FAIL SMS_D_Ld1_P48x1.f10_f10_mg37.I2000Clm45BgcCrop.izumi_nag.clm-oldhyd RUN time=37
FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm50BgcCrop.izumi_nag.clm-irrig_alternate RUN time=38
FAIL SMS_D_Ld65.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-FireLi2024GSWP RUN time=46
FAIL SMS_D_P48x1_Ld5.f10_f10_mg37.I2000Clm50BgcCrop.izumi_nag.clm-irrig_spunup RUN time=47
FAIL SMS_Ld5_D_P48x1.f10_f10_mg37.IHistClm50Bgc.izumi_nag.clm-monthly RUN time=126

All the cesm.logs are showing similar failure messages:

[i030.cgd.ucar.edu:mpi_rank_7][error_sighandler] Caught error: Segmentation fault (signal 11)
[i030.cgd.ucar.edu:mpi_rank_24][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3366609 RUNNING AT i030.cgd.ucar.edu
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 3, 2025

@glemieux I'm concerned about your seeing the seg faults. But, is this just due to Izumi instability? Try the same tests with the baseline version (ctsm5.3.016) and if they fail the same way -- then let's move forward.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 3, 2025

Updated to say, try these tests for vanilla ctsm5.3.016 and if they fail the same way -- move forward. If they don't we probably need to figure it out, starting with the first tmp tag that @slevis-lmwg made.

@glemieux
Copy link
Collaborator Author

glemieux commented Jan 4, 2025

Updated to say, try these tests for vanilla ctsm5.3.016 and if they fail the same way -- move forward. If they don't we probably need to figure it out, starting with the first tmp tag that @slevis-lmwg made.

Roger that. Testing with ctsm5.3.016 on izumi is underway.

@glemieux
Copy link
Collaborator Author

glemieux commented Jan 4, 2025

Regression testing ctsm5.3.016 only has expected test failures. So this looks like this is something specific to the tmp branch.

Results: /home/glemieux/scratch/ctsm-tests/tests_0103-164806iz

@slevis-lmwg
Copy link
Contributor

@glemieux @ekluzek
This is good news regarding izumi and bad news about the recent tag that I made, which I now have to debug. I do not expect to start debugging before Monday, so let's discuss at the Stand-up.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 6, 2025

From the standup this morning. The issue for @slevis-lmwg to work on is #2924. Once, we know a little more about this, we will likely let @glemieux merge this as is, since it's orthogonal and so he doesn't have to redo testing. Then @slevis-lmwg will tag a fix after that.

@ekluzek ekluzek added this to the cesm3_0_beta06 milestone Jan 6, 2025
Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemieux I just looked this over. It's obviously really simple, but to confirm what's going on. The FatesHydro test used to have a special FatesHydro parameter file that it had to generate -- and now it just uses the default one? You also added a PVT test to expected fails.

@glemieux
Copy link
Collaborator Author

glemieux commented Jan 6, 2025

@glemieux I just looked this over. It's obviously really simple, but to confirm what's going on. The FatesHydro test used to have a special FatesHydro parameter file that it had to generate -- and now it just uses the default one? You also added a PVT test to expected fails.

Yep, that's correct. A while back, via #2700, I implemented the special FATES hydro parameter file as a temp workaround to a bug that had been exposed during testing that was on the FATES-side. The fix came in via fates tag sci.1.80.1_api.37.0.0, which we're getting via the fates .gitmodule update here as well.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 7, 2025

@glemieux go ahead and move your baselines in place as the next tmp tag (branch_tags/tmp-241219.n01.ctsm5.3.016, and create the ChangeLog update for it, and let's make finish this off. I'm not sure who's turn it is to do the next FATES tag, so I'll volunteer to do it.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jan 7, 2025

Minor correction to the new tag name:
branch_tags/tmp-241219.n02.ctsm5.3.016
I see how this does not follow convention exactly, but this what Erik and I came up with :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FATES A change needed for FATES that doesn't require a FATES API update. test: aux_clm Pass aux_clm suite before merging test: fates Pass fates test suite before merging
Projects
Status: In progress - master/b4b-dev
Development

Successfully merging this pull request may close these issues.

Remove fates_allom_smode shell_command update in FatesColdHydro testmod
4 participants