-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not bitwise reproducible #40
Comments
For strict bit-for-bit reproducibility [edit: setting this in |
These runs have identical initial conditions (cold start), identical inputs, parameters and executables:
but the resulting restarts for cice, coupler and mom6 differ (whereas the datm and drof restarts are identical):
|
details on which individual variables differ are here |
|
I'm not sure what to try next.
|
Couldn't run for 1 timestep with these settings in
due to a segmentation fault.
|
I was able to do a 2-timestep run. The same 3 restarts still differ:
|
@aekiss I haven't checked my AMIP runs but I'd be very surprised if they were reproducible! Can you share your nuopc.runseq and nuopc.runconfig? I think the |
Thanks @kieranricardo, I hadn't realised that about the |
Although MOM6 and CICE6 have the same grid dimensions (lon x lat = 320x384), regridding is needed because MOM6 is C-grid and we are using B-grid CICE6 (at present). The JRA55 data stream has different grid dimensions (640x320) so is also regridded. When we switch to C-grid CICE6 there will be no need to regrid to couple with MOM6, so any reproducibility issues there should disappear. |
The mediator log
this seems odd - I expected some regridding from C to B grid
|
@aekiss annoyingly CMEPS only supports one grid/mesh per each component. For the UM cap we only export fields on the density/pressure points, and the mapping from the velocity points to the density points happens inside the CAP. Obviously this is a little suboptimal with some fields going |
Thanks for clarifying. That is consistent with what I understood, that the MOM6-CICE6 coupling takes place on the A grid. I had the impression that work is underway to support direct MOM6-CICE6 coupling on the C grid, but that would involve supporting more than one grid per component. |
Info on bitwise reproducibility in MOM6: https://github.com/NOAA-GFDL/MOM6-examples/wiki/Developers-guide#debugging |
The CMEPS driver |
In CICE6 we use CICE6 supports reproducible sums depending on the setting of |
@kieranricardo following up on your earlier comment, my |
@kieranricardo well I'm stumped. It's not reproducible with 1 CPU for all components. I expect I'm missing something obvious... |
Is there a NUOPC option to save all the coupling fields to a file (preferably both sides of the interpolation). Dave added this to CM2 and it's been very useful. |
@aekiss what?! That's bizarre.... Could MOM or CICE be using threads anywhere? I'll have a closer look and see if CMEPS or CDEPS is. |
@aekiss can you run for less than one coupling time step (not 100% sure if this possible) just to verify that it's the coupling causing the issues? Hopefully that'll be reproducible. It might also be worth logging the number of OMP threads, if something is setting them > 1 then CICE at least will be parallel which might make the cap non-reproducible. |
@kieranricardo only 1 thread is being used
not sure if we can run for less than 1 coupling timestep, but I'll look into it |
I tried running with this
|
Thanks for the suggestion @MartinDix - I enabled writing some ATM->MED coupler output every timestep with this commit. The resulting files
span the full single-precision range and look like nonsense (though not random), e.g.
Is this some sort of type conversion error? Or uninitialised arrays? Though it doesn't look random enough (similar patterns appear for all variables and time steps) and I'm using the debug executable |
In any case, these files are identical when I re-run, so they aren't related to the non-reproducibility. |
Suggestions from today's TWG
|
Warm start also yields non-bitwise reproducible runs. |
Setting explicitly the |
I think @penguian mentioned a third flag we should try? |
I was able to get identical restarts from two 2-time-step runs using a new executable built entirely with CIME. I'm not sure how I could muck this up but it's always possible with me so it would be good if someone could try replicate. The new executable should be usable by anyone on https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36 and the restarts are here:
which returns nothing |
@dougiesquire That are actually very good news, as this means the issue is very likely in ESMF. I'll try using the same ESMF build as you and see what happens. |
(I tried building the executable to |
@dougiesquire you should have write access to |
Sure, that might be helpful in the near future. I've rebuilt the CIME executable to |
ok - you'll need to apply for the |
A quick inspection of the differences between the Spack-built ESMF and the EMSF built by Martin shows that the later was built in debug mode, while the former was built with optimizations. In practice, the later sets I'm struggling to use the same ESMF as the cime build, as the netCDF version used there is not the same as the one of the other dependencies built with Spack, so instead I'm recompiling ESMF with Spack in debug mode. |
Okay, so that's confirmed: compiling ESMF in debug mode leads to reproducible runs. I'm not sure how critical ESMF is for performance, but it might be worth finding out which optimization level can be safely used to compile it. |
I was also able to get reproducible runs using 48 cores |
The production executable generated with CMake is also bit-wise reproducible 🎉 |
I can confirm both. |
Maybe |
My ESMF build had (from
I think all the work is done in C++ routines and so the F90 options are unlikely to affect the reproducibility. The Intel compiler default is -O2, so the C++ options here don't seem very restrictive. Did the spack build use -O3? |
@MartinDix I'm afraid this is not the case here, as the
This holds for the the Fortran and C/C++ compilers. Here are the options used by the Spack taken from the
In this case, I suspect the non-bitwise reproducibility will get fixed by setting the floating-point model to |
I can confirm that adding I would say we now have a solution to this problem, so I'm closing the issue. |
The current MOM6-CICE6 config (and presumably others) is not reproducible - compare these
The text was updated successfully, but these errors were encountered: