-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libaccessom2 intercommunicator profiling support #17
Conversation
This patch modifies CICE to support changes to libaccessom2 which allow the user to disable intercommunicators, instead using MPI_COMM_WORLD to exchange messages between models. This work is primarily to support limitations in the Score-P profiler. Specific changed detailed below. - accessom2%config_sync uses the new API, and requires an libaccessom2 update. - il_commatm has been removed from cpl_interface, since coupler%atm_root provides this information. - Most of the work in prism_init has been removed are replaced with equivalent operations in coupler%init_begin. - Explicit MPI send/recv calls now use ranks provided by coupler, rather than explicit sends to 0 over OASIS-provided intercommunicators. - Similarly, data sends now happen on the zero rank of CICE, rather than checking if the atmosphere rank of the intercommunicator is zero. Functionally, there should be no difference, but beware the unexpected. - Buildscripts for score-p have been added. Further changes to cpl_interface are possible, since most of the MPI information is provided by the coupler object.
Thanks Nic |
Yep. Did you see any problems with it? I'm getting the following: forrtl: error (73): floating divide by zero |
It's all working for me, but I would not be surprised if there are problems. You do need to update all the models here, and even the module files need to be updated, it could be using old object files or module headers. I'll look at the trace though. |
This looks like a very similar bug that we saw ages ago, where one of these variable (I think
Could this be an old problem that re-emerged? |
Going over the old Slack logs, I encountered a similar problem where If you are using my modified YATM and MOM source though, then none of this is relevant. |
My hunch is that it can happen when there's something up with a coupling field or a grid mismatch. |
Which experiment are you running? I can test it out in my environment. |
I was running 025_jra55_ryf, now moving onto the 1deg. I have a feeling we'll get rid of the intercommunicators now that we have this way of doing things. Do see any reason to hold on to the intercommunicators? |
With 1deg_jra55_ryf I get a different error: forrtl: error (72): floating overflow |
Have you tried with my changes to YATM? Or is this the unmodified version?
…On Sun, 1 Jul 2018, 16:26 Nic Hannah, ***@***.***> wrote:
With 1deg_jra55_ryf I get a different error:
forrtl: error (72): floating overflow
Image PC Routine Line Source
cice_auscom_360x3 000000000093CDB1 Unknown Unknown Unknown
cice_auscom_360x3 000000000093AEEB Unknown Unknown Unknown
cice_auscom_360x3 00000000008E8D04 Unknown Unknown Unknown
cice_auscom_360x3 00000000008E8B16 Unknown Unknown Unknown
cice_auscom_360x3 0000000000867149 Unknown Unknown Unknown
cice_auscom_360x3 0000000000871ABC Unknown Unknown Unknown
libpthread-2.12.s 00002B99D4A8D7E0 Unknown Unknown Unknown
cice_auscom_360x3 00000000005464C4 ice_grid_mp_tlatl 1340 ice_grid.f90
cice_auscom_360x3 000000000054CAEB ice_grid_mp_init_ 390 ice_grid.f90
cice_auscom_360x3 000000000040EECC cice_initmod_mp_c 147 CICE_InitMod.f90
cice_auscom_360x3 000000000040C83D MAIN__ 54 CICE.f90
cice_auscom_360x3 000000000040C7DE Unknown Unknown Unknown
libc-2.12.so 00002B99D4EBDD1D __libc_start_main Unknown Unknown
cice_auscom_360x3 000000000040C6E9 Unknown Unknown Unknown
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#17 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAcN61ufYShNxpCR4FM7lZfqQG5KwvSDks5uCGuIgaJpZM4Uqqx->
.
|
I've merged all pull requests. |
I probably should have said in the last post, but I changed the config_sync function's arguments. It could be that everything compiles OK bit that garbage (or zeros) are being passed out read in the function input stack. |
It looks like the MOM changes are still outstanding: Not sure if this is why it's failing in ice_grid though. (Not in a great position to check this trace at the moment.) Apologies for causing chaos here. |
I haven't pushed the mom changes back yet but I do have them locally. It's all fine, just keen to get your changes in. |
So I did get the 1deg_jra_ryf run to work using master CICE and libaccessom2 (my changes + yours since my PR), and my local MOM branch (which may be missing any changes by you since then). I did notice one particular problem where I needed to include an empty Since you are not seeing this error, I wonder if we are running the same code. Maybe your MOM source is different from mine? The only other thing I can think of is that you may need to delete+rebuild your object/mod files for all 3 models. Edit: @nicjhan mentioned to me that he did see the |
Following up here, I didn't intend for everyone to add a whole new namelist group to every single Also, I haven't commented on whether we should just disable all interconnects and use |
@nicjhan Why are all those stack traces pointing back at .f90 files rather than the original .F90 source? Is there a change to how the compilation is done or has cice always had a separate preprocessing step? |
Hi, I'm running into the same problem (COSIMA/access-om2#173) Nic reported above:
And, like Marshall said, it seems to be linked to Thanks! |
This patch modifies CICE to support changes to libaccessom2 which allow
the user to disable intercommunicators, instead using MPI_COMM_WORLD to
exchange messages between models. This work is primarily to support
limitations in the Score-P profiler.
Specific changed detailed below.
accessom2%config_sync uses the new API, and requires an libaccessom2
update.
il_commatm has been removed from cpl_interface, since coupler%atm_root
provides this information.
Most of the work in prism_init has been removed are replaced with
equivalent operations in coupler%init_begin.
Explicit MPI send/recv calls now use ranks provided by coupler, rather
than explicit sends to 0 over OASIS-provided intercommunicators.
Similarly, data sends now happen on the zero rank of CICE, rather than
checking if the atmosphere rank of the intercommunicator is zero.
Functionally, there should be no difference, but beware the
unexpected.
Buildscripts for score-p have been added.
Further changes to cpl_interface are possible, since most of the MPI
information is provided by the coupler object.