Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libaccessom2 intercommunicator profiling support #17

Merged
merged 1 commit into from
Jul 1, 2018

Conversation

marshallward
Copy link
Contributor

This patch modifies CICE to support changes to libaccessom2 which allow
the user to disable intercommunicators, instead using MPI_COMM_WORLD to
exchange messages between models. This work is primarily to support
limitations in the Score-P profiler.

Specific changed detailed below.

  • accessom2%config_sync uses the new API, and requires an libaccessom2
    update.

  • il_commatm has been removed from cpl_interface, since coupler%atm_root
    provides this information.

  • Most of the work in prism_init has been removed are replaced with
    equivalent operations in coupler%init_begin.

  • Explicit MPI send/recv calls now use ranks provided by coupler, rather
    than explicit sends to 0 over OASIS-provided intercommunicators.

  • Similarly, data sends now happen on the zero rank of CICE, rather than
    checking if the atmosphere rank of the intercommunicator is zero.
    Functionally, there should be no difference, but beware the
    unexpected.

  • Buildscripts for score-p have been added.

Further changes to cpl_interface are possible, since most of the MPI
information is provided by the coupler object.

This patch modifies CICE to support changes to libaccessom2 which allow
the user to disable intercommunicators, instead using MPI_COMM_WORLD to
exchange messages between models.  This work is primarily to support
limitations in the Score-P profiler.

Specific changed detailed below.

- accessom2%config_sync uses the new API, and requires an libaccessom2
  update.

- il_commatm has been removed from cpl_interface, since coupler%atm_root
  provides this information.

- Most of the work in prism_init has been removed are replaced with
  equivalent operations in coupler%init_begin.

- Explicit MPI send/recv calls now use ranks provided by coupler, rather
  than explicit sends to 0 over OASIS-provided intercommunicators.

- Similarly, data sends now happen on the zero rank of CICE, rather than
  checking if the atmosphere rank of the intercommunicator is zero.
  Functionally, there should be no difference, but beware the
  unexpected.

- Buildscripts for score-p have been added.

Further changes to cpl_interface are possible, since most of the MPI
information is provided by the coupler object.
@nichannah nichannah merged commit 637de60 into COSIMA:master Jul 1, 2018
@marshallward
Copy link
Contributor Author

Thanks Nic
Just a warning that this won't build unless the libaccessom2 are also merged.

@nichannah
Copy link
Contributor

Yep. Did you see any problems with it? I'm getting the following:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
cice_auscom_1440x 0000000000924461 Unknown Unknown Unknown
cice_auscom_1440x 000000000092259B Unknown Unknown Unknown
cice_auscom_1440x 00000000008D03B4 Unknown Unknown Unknown
cice_auscom_1440x 00000000008D01C6 Unknown Unknown Unknown
cice_auscom_1440x 000000000084E7F9 Unknown Unknown Unknown
cice_auscom_1440x 00000000008591A9 Unknown Unknown Unknown
libpthread-2.12.s 00002B4C56A8D7E0 Unknown Unknown Unknown
cice_auscom_1440x 000000000044C657 ice_atmo_mp_atmo_ 352 ice_atmo.f90
cice_auscom_1440x 00000000005E1946 ice_step_mod_mp_s 395 ice_step_mod.f90
cice_auscom_1440x 000000000040FBE2 cice_runmod_mp_ci 329 CICE_RunMod.f90
cice_auscom_1440x 000000000040C847 MAIN__ 60 CICE.f90
cice_auscom_1440x 000000000040C7DE Unknown Unknown Unknown
libc-2.12.so 00002B4C56EBDD1D __libc_start_main Unknown Unknown
cice_auscom_1440x 000000000040C6E9 Unknown Unknown Unknown

@marshallward
Copy link
Contributor Author

It's all working for me, but I would not be surprised if there are problems.

You do need to update all the models here, and even the module files need to be updated, it could be using old object files or module headers.

I'll look at the trace though.

@marshallward
Copy link
Contributor Author

This looks like a very similar bug that we saw ages ago, where one of these variable (I think thva?) was not being initialized. Equation below:

 349         ! compute stability & evaluate all stability functions                  
 350             hol(ij) = vonkar * gravit * zlvl(i,j) &                             
 351                    * (tstar(ij)/thva(ij) &                                      
 352                     + qstar(ij)/(c1/zvir+Qa(i,j))) &                            
 353                    / ustar(ij)**2                                               

Could this be an old problem that re-emerged?

@marshallward
Copy link
Contributor Author

Going over the old Slack logs, I encountered a similar problem where thva was being set to zero because it wasnt getting data from MATM. Have you also updated the YATM code here? I had to change all of the models, including YATM and MOM, not just CICE and libaccessom2. It may be that the old YATM is not correctly updating Tair (or whatever) inside of CICE.

If you are using my modified YATM and MOM source though, then none of this is relevant.

@nichannah
Copy link
Contributor

My hunch is that it can happen when there's something up with a coupling field or a grid mismatch.

@marshallward
Copy link
Contributor Author

Which experiment are you running? I can test it out in my environment.

@nichannah
Copy link
Contributor

nichannah commented Jul 1, 2018

I was running 025_jra55_ryf, now moving onto the 1deg.

I have a feeling we'll get rid of the intercommunicators now that we have this way of doing things. Do see any reason to hold on to the intercommunicators?

@nichannah
Copy link
Contributor

With 1deg_jra55_ryf I get a different error:

forrtl: error (72): floating overflow
Image PC Routine Line Source
cice_auscom_360x3 000000000093CDB1 Unknown Unknown Unknown
cice_auscom_360x3 000000000093AEEB Unknown Unknown Unknown
cice_auscom_360x3 00000000008E8D04 Unknown Unknown Unknown
cice_auscom_360x3 00000000008E8B16 Unknown Unknown Unknown
cice_auscom_360x3 0000000000867149 Unknown Unknown Unknown
cice_auscom_360x3 0000000000871ABC Unknown Unknown Unknown
libpthread-2.12.s 00002B99D4A8D7E0 Unknown Unknown Unknown
cice_auscom_360x3 00000000005464C4 ice_grid_mp_tlatl 1340 ice_grid.f90
cice_auscom_360x3 000000000054CAEB ice_grid_mp_init_ 390 ice_grid.f90
cice_auscom_360x3 000000000040EECC cice_initmod_mp_c 147 CICE_InitMod.f90
cice_auscom_360x3 000000000040C83D MAIN__ 54 CICE.f90
cice_auscom_360x3 000000000040C7DE Unknown Unknown Unknown
libc-2.12.so 00002B99D4EBDD1D __libc_start_main Unknown Unknown
cice_auscom_360x3 000000000040C6E9 Unknown Unknown Unknown

@marshallward
Copy link
Contributor Author

marshallward commented Jul 1, 2018 via email

@nichannah
Copy link
Contributor

I've merged all pull requests.

@marshallward
Copy link
Contributor Author

I probably should have said in the last post, but I changed the config_sync function's arguments. It could be that everything compiles OK bit that garbage (or zeros) are being passed out read in the function input stack.

@marshallward
Copy link
Contributor Author

It looks like the MOM changes are still outstanding:

mom-ocean/MOM5#226

Not sure if this is why it's failing in ice_grid though. (Not in a great position to check this trace at the moment.)

Apologies for causing chaos here.

@nichannah
Copy link
Contributor

I haven't pushed the mom changes back yet but I do have them locally.

It's all fine, just keen to get your changes in.

@marshallward
Copy link
Contributor Author

marshallward commented Jul 1, 2018

So I did get the 1deg_jra_ryf run to work using master CICE and libaccessom2 (my changes + yours since my PR), and my local MOM branch (which may be missing any changes by you since then).

I did notice one particular problem where I needed to include an empty &coupler_nml / group inside of accessom2.nml. I tested the flag as true and false, but did not test if the namelist was entirely absent.

Since you are not seeing this error, I wonder if we are running the same code. Maybe your MOM source is different from mine? The only other thing I can think of is that you may need to delete+rebuild your object/mod files for all 3 models.

Edit: @nicjhan mentioned to me that he did see the coupler_nml error and had fixed it himself.

@marshallward
Copy link
Contributor Author

marshallward commented Jul 1, 2018

Following up here, I didn't intend for everyone to add a whole new namelist group to every single accessom2.nml; this was a major oversight on my part. I forgot that a namelist must include every group the group that it's reading (though not every variable of that group). I'd be up for changing this, maybe adding to the accessom2_nml group somehow.

Also, I haven't commented on whether we should just disable all interconnects and use MPI_COMM_WORLD, because I figure it's better to first ensure that everything is working!

@russfiedler
Copy link
Contributor

@nicjhan Why are all those stack traces pointing back at .f90 files rather than the original .F90 source? Is there a change to how the compilation is done or has cice always had a separate preprocessing step?

@Acosta-Goncalves
Copy link

Hi,

I'm running into the same problem (COSIMA/access-om2#173) Nic reported above:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
cice_auscom_1440x 000000000044C657 ice_atmo_mp_atmo_ 352 ice_atmo.f90

And, like Marshall said, it seems to be linked to thva. Did you ever figure out how to solve the issue?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants