Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with MOM5 with current main #352

Closed
mathomp4 opened this issue Nov 10, 2021 · 23 comments
Closed

Error with MOM5 with current main #352

mathomp4 opened this issue Nov 10, 2021 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@mathomp4
Copy link
Member

The current GEOSgcm main (as of today) seems to have an issue running MOM5. But as near as I can tell nothing has fundamentally changed with MOM5! There were some changes to GEOSgcm_App from Ricardo, but I tested those in a MOM5 run yesterday and it seemed to work. I also did an xxdiff between a working run from last night to the current test and pretty much all the differences are in whitespace!

To wit as an example in this file (NOTE the SLURM file name will change everynight):

/discover/nobackup/mathomp4/SystemTests/runs/AGCM_MOM5/c90_MOM5_GOCART/CURRENT/run/1day/slurm-47377862.out

the error is:

 Integer*4 Resource Parameter: CICE_NDYN_DT:1
                                  Memuse(MB) at SEAICEMAPL_GenericInitialize=  3.498E+02  3.498E+02  3.318E+02  3.375E+02  0.000E+00
                                                           Mem/Swap Used (MB) at SEAICEMAPL_GenericInitialize=  1.548E+04  0.000E+00
                                                CommitLimit/Committed_AS (MB) at SEAICEMAPL_GenericInitialize=  1.801E+05  4.323E+04
NOTE from PE     0: diag_manager_mod::diag_manager_init: prepend_date only supported when diag_manager_init is called with time_init present.
MOMInitialize                                  877
MOMInitialize                                  877

Looking in GEOSgcm_GridComp:

! Check local sizes of two horizontal dimensions
!-----------------------------------------------

    call mom4_get_dimensions(isc, iec, jsc, jec, nk_out=LM)
    call MAPL_GridGet(GRID, localCellCountPerDim=counts, RC=status)
    VERIFY_(STATUS)

    IM=iec-isc+1
    JM=jec-jsc+1
    
    ASSERT_(counts(1)==IM)
    ASSERT_(counts(2)==JM)

the line in question (877) is:

    ASSERT_(counts(1)==IM)

This error did not happen in MAPL develop tests last night, so I can't blame MAPL.

I will probably be consulting @yvikhlya and @sanAkel about this.

@mathomp4 mathomp4 self-assigned this Nov 10, 2021
@mathomp4 mathomp4 added the bug Something isn't working label Nov 10, 2021
@mathomp4
Copy link
Member Author

Well, I ran MOM5 at NAS on this "day of no NCCS" and it's duplicable.

To try and debug, first, I added some prints in that chunk of code above:

    write(*,*) "counts(1): ", counts(1)
    write(*,*) "counts(2): ", counts(2)
    write(*,*) "iec: ", iec, "isc: ", isc
    write(*,*) "jec: ", iec, "jsc: ", isc
    write(*,*) "IM :", IM
    write(*,*) "JM :", JM

Here counts is from MAPL and iec, jec, etc. come from MOM5 (via FMS I guess?). In the good run (v10.19.4):

 counts(1):           10
 counts(2):           20
 iec:          290 isc:          281
 jec:          290 jsc:          281
 IM :          10
 JM :          20

and on a bad run (main):

 counts(1):           10
 counts(2):           20
 iec:            0 isc:            0
 jec:            0 jsc:            0
 IM :           1
 JM :           1

So for some reason, MOM/FMS is returning...no grid? Or something.

And again...we did not touch MOM5. Or FMS.

And Ricardo's App changes are pretty boring. I did a test where I made an experiment in main but used the GEOSgcm.x from v10.19.4 and that is happy, so I can't see there being an issue in the gcm_setup phase.

@yvikhlya
Copy link

Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.

@mathomp4
Copy link
Member Author

Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.

@yvikhlya I agree but...nothing changed. My hope was that input.nml or something that helps control MOM got screwed up, but nope. Exactly the same!

My current fear is that this is one of those "We changed the memory state and things are running differently" things. As in, we can add a print statement in MOM5 somewhere and all will work again!

@sanAkel
Copy link
Collaborator

sanAkel commented Nov 10, 2021

@mathomp4
I can try look into it with you on Monday next week - in a debug build.

@sanAkel
Copy link
Collaborator

sanAkel commented Nov 16, 2021

We will get back to this issue on a later date. @mathomp4 closing it for now, please feel free to reopen if needed. Thanks!

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?

@sanAkel
Copy link
Collaborator

sanAkel commented Jul 8, 2022

Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?

It may be easier or faster to know why using 1-deg resolution that @mathomp4 says has the same problem.

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

I already have 0.25 degree set up and interactive session, so I don't see a need to switch to 1 degree.

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

The last successful run with MOM5 I did was with v10.14.1 about 2 years ago. Something got broken since then.

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.

@mathomp4
Copy link
Member Author

mathomp4 commented Jul 8, 2022

@yvikhlya Not really. When this happened I was just confused. It just sort of "happened" one night and the only changes I could see were whitespace changes! It was like all of the sudden the system decided to do this.

I suppose one possible thought is to try a run with GNU? Maybe it will show a different error? I am not sure.

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?

@mathomp4
Copy link
Member Author

mathomp4 commented Jul 8, 2022

@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?

If you are seeing "access token", you might have cloned the https URL instead of the SSH. You can run git remote -v to see what you have in that repo to confirm.

If that happened, you can switch your remote url with:

git remote set-url origin [email protected]:GEOS-ESM/GEOSgcm.git

where you change that to whichever repo you are in.

Now, if you are like me and never want an HTTPS url from github ever again, you can run:

git config --global url."[email protected]:".insteadOf "https://github.com/"

and from now on, git will always clone with SSH from github even if you accidentally pass it an HTTPS one!

@yvikhlya
Copy link

yvikhlya commented Jul 8, 2022

@mathomp4 Thanks! That was it.

@sanAkel
Copy link
Collaborator

sanAkel commented Jul 8, 2022

@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.

Well, 2 suggestions:

  1. Use the debugger- the debug build worked for me a few months ago.
  2. Again please use the 1-deg version. That will be easier to work with.

@yvikhlya
Copy link

yvikhlya commented Jul 14, 2022

@mathomp4 There is something wrong here. A printout from MOM5 run:

NOTE from PE     0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90

ocean_model_MOM.F90 is a part of MOM6, not MOM5. MOM5 should search for ocean_model_init() in the ocean_model.F90. There is a name collision here.

P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but ocean_model_init from MOM6, not from MOM5.

@sanAkel
Copy link
Collaborator

sanAkel commented Jul 15, 2022

@mathomp4 There is something wrong here. A printout from MOM5 run:


NOTE from PE     0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90

ocean_model_MOM.F90 is a part of MOM6, not MOM5. MOM5 should search for ocean_model_init() in the ocean_model.F90. There is a name collision here.

P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but ocean_model_init from MOM6, not from MOM5.

Hmm! Maybe that shared object lib/ DSO stuff hitting us again?

@mathomp4
Copy link
Member Author

We might need to add back LD_PRELOAD?

@yvikhlya
Copy link

@mathomp4 Could you remind me how to use LD_PRELOAD in csh? It works in bash for me:

$ LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
        /home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so (0x00002b81fdeef000)
        libmom6.so => /home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom6.so (0x00002b82129ee000)

But gives error in csh:

> LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so: Command not found.
> set LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
set: Variable name must contain alphanumeric characters.

@mathomp4
Copy link
Member Author

@yvikhlya You have to use env:

env LD_PRELOAD=${GEOSDIR}/lib/libmom5.so ...

@yvikhlya
Copy link

yvikhlya commented Jul 15, 2022

LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).

The model crashed in land component though with error:

<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>

This is a whole separate issue, something is wrong with restarts which we generated with @sanAkel last week. I am investigating this issue.

@mathomp4
Copy link
Member Author

LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).

Nice! I suppose a simple "If MOM5, add LD_PRELOAD" can work.

The model crashed in land component tough with error:

<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>

Ouch. Yeah. That's when I start asking around!

@sanAkel
Copy link
Collaborator

sanAkel commented Aug 31, 2022

I can confirm that works for both:

  • MOM5
  • MOM6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants