Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not bitwise reproducible #40

Closed
aekiss opened this issue Jun 21, 2023 · 54 comments
Closed

not bitwise reproducible #40

aekiss opened this issue Jun 21, 2023 · 54 comments
Labels
bug Something isn't working build system Build system

Comments

@aekiss
Copy link
Contributor

aekiss commented Jun 21, 2023

The current MOM6-CICE6 config (and presumably others) is not reproducible - compare these

/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2
@aekiss
Copy link
Contributor Author

aekiss commented Jun 21, 2023

For strict bit-for-bit reproducibility srcTermProcessing=1 and termOrder=srcseq are required in nuopc.runseq. See details here and here.

[edit: setting this in nuopc.runseq actually isn't necessary for reproducibility]

@aekiss aekiss changed the title bitwise reproducibility not bitwise reproducible Jun 21, 2023
@aekiss
Copy link
Contributor Author

aekiss commented Jun 21, 2023

These runs have identical initial conditions (cold start), identical inputs, parameters and executables:

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/output000/manifests/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/output000/manifests/
$

but the resulting restarts for cice, coupler and mom6 differ (whereas the datm and drof restarts are identical):

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ

@aekiss
Copy link
Contributor Author

aekiss commented Jun 21, 2023

details on which individual variables differ are here

@aekiss
Copy link
Contributor Author

aekiss commented Jun 23, 2023

repro_test_3 and repro_test_4 confirm lack of reproducibility with latest debug build /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ

aekiss added a commit to ACCESS-NRI/access-om3-configs that referenced this issue Jun 23, 2023
@aekiss
Copy link
Contributor Author

aekiss commented Jun 23, 2023

repro_test_5 and repro_test_6 also don't reproduce, despite using srcTermProcessing=1:termOrder=srcseq in nuopc.runseq as described here, which is supposed to provide the strictest bit-for-bit reproducibility in remapping.

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ

I'm not sure what to try next.

  • I'm not certain that the model has picked up these settings - where would I find check that? I've checked the logs and other outputs and found nothing. Maybe I need to turn up the verbosity somewhere? Should I put these settings somewhere other than nuopc.runseq?
  • Are the individual model components reproducible? Maybe I should try standalone configs?
  • Are the CIME-built exes reproducible?
  • @kieranricardo are your AMIP configs reproducible?

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

Couldn't run for 1 timestep with these settings in nuopc.runconfig

     restart_n = 1
     restart_option = nsteps
...
     stop_n = 1
     stop_option = nsteps

due to a segmentation fault.

[gadi-cpu-clx-0432:475851:0:475851] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6630)
==== backtrace (tid: 475877) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000001bb12ee mom_mp_mom_state_is_synchronized_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/src/core/MOM.F90:3830
 2 0x0000000001a7965d mom_ocean_model_nuopc_mp_ocean_model_restart_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_ocean_model_nuopc.F90:723
 3 0x0000000001a0b784 mom_cap_mod_mp_modeladvance_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_cap.F90:1690
 4 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
 5 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
 6 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
 7 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
 8 0x00000000069db2cd nuopc_modelbase_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2220
 9 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
10 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
11 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
12 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
13 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
14 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
15 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
16 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
17 0x00000000006956fc nuopc_driver_mp_executerunsequence_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
18 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
19 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
20 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
21 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
22 0x0000000000692052 nuopc_driver_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
23 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
24 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
25 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
26 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
27 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
28 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
29 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
30 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
31 0x00000000006956fc nuopc_driver_mp_executerunsequence_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
32 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
33 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
34 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
35 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
36 0x0000000000692052 nuopc_driver_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
37 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
38 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
39 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
40 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
41 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
42 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
43 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
44 0x0000000000431bb4 MAIN__()  /g/data/v45/aek156/access-om3-build/access-om3/CMEPS/CMEPS/cesm/driver/esmApp.F90:141
45 0x0000000000430d62 main()  ???:0
46 0x000000000003ad85 __libc_start_main()  ???:0
47 0x0000000000430c6e _start()  ???:0
=================================

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

I was able to do a 2-timestep run. The same 3 restarts still differ:

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc differ

@kieranricardo
Copy link
Collaborator

@aekiss I haven't checked my AMIP runs but I'd be very surprised if they were reproducible! Can you share your nuopc.runseq and nuopc.runconfig?

I think the nuopc.runseq flags only effect the transfer of data between the components and the mediator. These are just copying data (no regridding involved) so should be bit reproducible anyway. All the actual regridding is happening in CMEPs so we'd have to convince CMEPs to do bit reproducible regridding. Not sure if this is possible or not, I'll have a bit of a dig around later today.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

Thanks @kieranricardo, I hadn't realised that about the nuopc.runseq flags. Having bit reproducibility is really important, e.g. so we can re-run sections of an experiment with different outputs or do regression testing.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

Although MOM6 and CICE6 have the same grid dimensions (lon x lat = 320x384), regridding is needed because MOM6 is C-grid and we are using B-grid CICE6 (at present). The JRA55 data stream has different grid dimensions (640x320) so is also regridded.

When we switch to C-grid CICE6 there will be no need to regrid to couple with MOM6, so any reproducibility issues there should disappear.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

The mediator log med.log shows the regridding method used for each field:

$ grep '^ mapping' /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/output000/log/med.log
 mapping atm->ocn Sa_u via patch_uv3d with one normalization
 mapping atm->ocn Sa_v via patch_uv3d with one normalization
 mapping atm->ocn Sa_z via bilnr with one normalization
 mapping atm->ocn Sa_tbot via bilnr with one normalization
 mapping atm->ocn Sa_pbot via bilnr with one normalization
 mapping atm->ocn Sa_shum via bilnr with one normalization
 mapping atm->ocn Sa_ptem via bilnr with one normalization
 mapping atm->ocn Sa_dens via bilnr with one normalization
 mapping atm->ocn Faxa_swnet via consf with one normalization
 mapping atm->ocn Faxa_rainc via consf with one normalization
 mapping atm->ocn Faxa_rainl via consf with one normalization
 mapping atm->ocn Faxa_snowc via consf with one normalization
 mapping atm->ocn Faxa_snowl via consf with one normalization
 mapping atm->ocn Faxa_lwdn via consf with one normalization
 mapping atm->ocn Faxa_swndr via consf with one normalization
 mapping atm->ocn Faxa_swvdr via consf with one normalization
 mapping atm->ocn Faxa_swndf via consf with one normalization
 mapping atm->ocn Faxa_swvdf via consf with one normalization
 mapping atm->ocn Sa_pslv via bilnr with one normalization
 mapping atm->ice Sa_u via patch_uv3d with one normalization
 mapping atm->ice Sa_v via patch_uv3d with one normalization
 mapping atm->ice Sa_z via bilnr with one normalization
 mapping atm->ice Sa_tbot via bilnr with one normalization
 mapping atm->ice Sa_pbot via bilnr with one normalization
 mapping atm->ice Sa_shum via bilnr with one normalization
 mapping atm->ice Sa_ptem via bilnr with one normalization
 mapping atm->ice Sa_dens via bilnr with one normalization
 mapping atm->ice Faxa_swnet via consf with one normalization
 mapping atm->ice Faxa_rainc via consf with one normalization
 mapping atm->ice Faxa_rainl via consf with one normalization
 mapping atm->ice Faxa_snowc via consf with one normalization
 mapping atm->ice Faxa_snowl via consf with one normalization
 mapping atm->ice Faxa_lwdn via consf with one normalization
 mapping atm->ice Faxa_swndr via consf with one normalization
 mapping atm->ice Faxa_swvdr via consf with one normalization
 mapping atm->ice Faxa_swndf via consf with one normalization
 mapping atm->ice Faxa_swvdf via consf with one normalization
 mapping atm->ice Faxa_bcph via consf with one normalization
 mapping atm->ice Faxa_dstwet via consf with one normalization
 mapping atm->ice Faxa_dstdry via consf with one normalization
 mapping atm->ice Sa_pslv via bilnr with one normalization
 mapping ocn->ice So_omask via fcopy
 mapping ocn->ice So_t via fcopy
 mapping ocn->ice So_s via fcopy
 mapping ocn->ice So_u via fcopy
 mapping ocn->ice So_v via fcopy
 mapping ocn->ice So_dhdx via fcopy
 mapping ocn->ice So_dhdy via fcopy
 mapping ocn->ice Fioo_q via fcopy
 mapping ice->ocn Faii_swnet via fcopy
 mapping ice->ocn Si_ifrac via fcopy
 mapping ice->ocn Fioi_swpen via fcopy
 mapping ice->ocn Fioi_swpen_vdr via fcopy
 mapping ice->ocn Fioi_swpen_vdf via fcopy
 mapping ice->ocn Fioi_swpen_idr via fcopy
 mapping ice->ocn Fioi_swpen_idf via fcopy
 mapping ice->ocn Fioi_melth via fcopy
 mapping ice->ocn Fioi_taux via fcopy
 mapping ice->ocn Fioi_tauy via fcopy
 mapping ice->ocn Fioi_meltw via fcopy
 mapping ice->ocn Fioi_salt via fcopy
 mapping rof->ocn Forr_rofl via rof2ocn_liq with none normalization
 mapping rof->ocn Forr_rofi via rof2ocn_ice with none normalization

this seems odd - I expected some regridding from C to B grid

 mapping ocn->ice So_u via fcopy
 mapping ocn->ice So_v via fcopy

@kieranricardo
Copy link
Collaborator

@aekiss annoyingly CMEPS only supports one grid/mesh per each component. For the UM cap we only export fields on the density/pressure points, and the mapping from the velocity points to the density points happens inside the CAP. Obviously this is a little suboptimal with some fields going UM v points -> UM p points -> MOM p points -> MOM v points. I think the MOM and CICE caps must be doing the same thing although I haven't found this in the code.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

Thanks for clarifying. That is consistent with what I understood, that the MOM6-CICE6 coupling takes place on the A grid. I had the impression that work is underway to support direct MOM6-CICE6 coupling on the C grid, but that would involve supporting more than one grid per component.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 26, 2023

Info on bitwise reproducibility in MOM6: https://github.com/NOAA-GFDL/MOM6-examples/wiki/Developers-guide#debugging

@aekiss
Copy link
Contributor Author

aekiss commented Jun 28, 2023

@aekiss
Copy link
Contributor Author

aekiss commented Jun 28, 2023

In CICE6 we use kvp=1 (EVP rheology), which is the only dynamics option that's bit-for-bit reproducible.

CICE6 supports reproducible sums depending on the setting of bfbflag, but this only affects global diagnostics written to the CICE log file, not the prognostic variables which are bit-for-bit identical with any bfbflag. We use bfbflag = "off", so global diagnostics won't be reproducible.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 28, 2023

@kieranricardo following up on your earlier comment, my nuopc.runseq and nuopc.runconfig are in the access_exe branch here.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 30, 2023

@kieranricardo well I'm stumped. It's not reproducible with 1 CPU for all components. I expect I'm missing something obvious...

@MartinDix
Copy link

Is there a NUOPC option to save all the coupling fields to a file (preferably both sides of the interpolation).

Dave added this to CM2 and it's been very useful.

@kieranricardo
Copy link
Collaborator

@aekiss what?! That's bizarre.... Could MOM or CICE be using threads anywhere? I'll have a closer look and see if CMEPS or CDEPS is.

@kieranricardo
Copy link
Collaborator

kieranricardo commented Jul 3, 2023

@aekiss can you run for less than one coupling time step (not 100% sure if this possible) just to verify that it's the coupling causing the issues? Hopefully that'll be reproducible.

It might also be worth logging the number of OMP threads, if something is setting them > 1 then CICE at least will be parallel which might make the cap non-reproducible.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 3, 2023

@kieranricardo only 1 thread is being used

$ grep OMP_ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_9/output000/env.yaml
OMP_NUM_THREADS: '1'

not sure if we can run for less than 1 coupling timestep, but I'll look into it

@aekiss
Copy link
Contributor Author

aekiss commented Jul 3, 2023

I tried running with this nuopc.runseq in the hope that it would be an uncoupled run, but it aborted after apparently initialising all components

runSeq::
@3600
  ICE
  ROF
  OCN
  ATM
@
::

@micaeljtoliveira micaeljtoliveira added the bug Something isn't working label Jul 4, 2023
@aekiss
Copy link
Contributor Author

aekiss commented Jul 4, 2023

Thanks for the suggestion @MartinDix - I enabled writing some ATM->MED coupler output every timestep with this commit.

The resulting files

/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.avrg.0001-01-01-03600.nc
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.inst.0001-01-01-03600.nc

span the full single-precision range and look like nonsense (though not random), e.g.

Screenshot 2023-07-04 at 4 24 50 pm
(I've restricted the range to [-1, 1] for plotting purposes)

Is this some sort of type conversion error?

Or uninitialised arrays? Though it doesn't look random enough (similar patterns appear for all variables and time steps) and I'm using the debug executable /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e which I'd hope would prevent access to uninitialised memory.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 4, 2023

In any case, these files are identical when I re-run, so they aren't related to the non-reproducibility.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 12, 2023

Suggestions from today's TWG

  • 1. run without payu
  • 2. try warm starts
  • 3. intel compiler flags to abort on access to uninitialised memory
  • 4. set fpmodel precise, align, etc compiler flags explicitly
  • 5. MOM checksums every timestep
  • 6. output all mediator fields, not just DATM-related
  • 7. compile with gcc
  • 8. use debug flags etc on dependencies in spack
  • 9. try rebuilding with CIME in the same way @dougie did (ie not with spack-built esmf etc) to see if that's reproducible
  • 10. use NCI builds of dependencies instead of spack (need newer esmf)
  • 11. debugging tool https://opus.nci.org.au/display/Help/TotalView
  • 12. check MOM6 standalone (eg MOM6-SIS2 configs)
  • 13. check CICE6 standalone
  • 14. CICE6 debug flags

@micaeljtoliveira
Copy link
Contributor

Warm start also yields non-bitwise reproducible runs.

@micaeljtoliveira
Copy link
Contributor

Setting explicitly the -qno-opt-dynamic-align -fp-model=precise flags does not solve the problem.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 12, 2023

I think @penguian mentioned a third flag we should try?

@dougiesquire
Copy link
Collaborator

dougiesquire commented Jul 12, 2023

I was able to get identical restarts from two 2-time-step runs using a new executable built entirely with CIME. I'm not sure how I could muck this up but it's always possible with me so it would be good if someone could try replicate. The new executable should be usable by anyone on tm70. The config is here:

https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36

and the restarts are here:

diff -r /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-0/restart000 /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-1/restart000

which returns nothing

@micaeljtoliveira
Copy link
Contributor

@dougiesquire That are actually very good news, as this means the issue is very likely in ESMF. I'll try using the same ESMF build as you and see what happens.

@dougiesquire
Copy link
Collaborator

(I tried building the executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6 but I don't have permission)

@aekiss
Copy link
Contributor Author

aekiss commented Jul 12, 2023

@dougiesquire you should have write access to /g/data/ik11/inputs/cime now.
While we're at it, do you want write access to all of /g/data/ik11/?

@dougiesquire
Copy link
Collaborator

dougiesquire commented Jul 13, 2023

While we're at it, do you want write access to all of /g/data/ik11/?

Sure, that might be helpful in the near future.

I've rebuilt the CIME executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6/2023-07-13 and updated the path in https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36 accordingly (UPDATE use this commit if trying to do the repro test: https://github.com/dougiesquire/MOM6-CICE6/tree/db12aefdd9dfaac283abb1d0cf3c9cf517005ae5)

@aekiss
Copy link
Contributor Author

aekiss commented Jul 13, 2023

ok - you'll need to apply for the ik11_w subgroup of ik11 on mancini

@micaeljtoliveira
Copy link
Contributor

A quick inspection of the differences between the Spack-built ESMF and the EMSF built by Martin shows that the later was built in debug mode, while the former was built with optimizations. In practice, the later sets -g, which implies -O0, while the other one sets -O (which is equivalent to -O2). This is the main difference. Other differences include the use of internal vs external Lapack and internal vs external PIO.

I'm struggling to use the same ESMF as the cime build, as the netCDF version used there is not the same as the one of the other dependencies built with Spack, so instead I'm recompiling ESMF with Spack in debug mode.

@micaeljtoliveira
Copy link
Contributor

Okay, so that's confirmed: compiling ESMF in debug mode leads to reproducible runs.

I'm not sure how critical ESMF is for performance, but it might be worth finding out which optimization level can be safely used to compile it.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Jul 13, 2023

For strict bit-for-bit reproducibility srcTermProcessing=1 and termOrder=srcseq are required in nuopc.runseq. See details here and here.

Note, I was able to also get reproducible runs without these set (with debug ESMF)

@dougiesquire
Copy link
Collaborator

I was also able to get reproducible runs using 48 cores

@micaeljtoliveira
Copy link
Contributor

The production executable generated with CMake is also bit-wise reproducible 🎉

@micaeljtoliveira
Copy link
Contributor

Note, I was able to also get reproducible runs without these set (with debug ESMF)

I was also able to get reproducible runs using 48 cores

I can confirm both.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 13, 2023

Maybe srcTermProcessing=1 and termOrder=srcseq are set by default somewhere?

@MartinDix
Copy link

Okay, so that's confirmed: compiling ESMF in debug mode leads to reproducible runs.

I'm not sure how critical ESMF is for performance, but it might be worth finding out which optimization level can be safely used to compile it.

My ESMF build had (from /scratch/tm70/mrd599/esmf-8.3.0/lib/libg/Linux.intel.x86_64_medium.openmpi.default/esmf.mk)

ESMF_F90COMPILEOPTS=-g -traceback -check arg_temp_created,bounds,format,output_conversion,stack,uninit -fPIC -debug minimal -assume realloc_lhs -m64 -mcmodel=medium -pthread -threads  -qopenmp
ESMF_CXXCOMPILEOPTS=-std=c++11 -g -traceback -Wcheck -fPIC -debug minimal -m64 -mcmodel=medium -pthread  -qopenmp

I think all the work is done in C++ routines and so the F90 options are unlikely to affect the reproducibility. The Intel compiler default is -O2, so the C++ options here don't seem very restrictive. Did the spack build use -O3?

@micaeljtoliveira
Copy link
Contributor

The Intel compiler default is -O2

@MartinDix I'm afraid this is not the case here, as the -g option is set. According to the Intel compiler manual:

This option turns off option -O2 and makes option -O0 the default unless option -O2 (or higher) is explicitly specified in the same command line.

This holds for the the Fortran and C/C++ compilers.

Here are the options used by the Spack taken from the esmf.mk file:

ESMF_F90COMPILEOPTS= -O -fPIC -debug minimal -assume realloc_lhs -m64 -mcmodel=small -pthread -threads  -qopenmp
ESMF_CXXCOMPILEOPTS= -std=c++11 -O -DNDEBUG -fPIC -debug minimal -m64 -mcmodel=small -pthread  -qopenmp

In this case, -O is equivalent to -O2.

I suspect the non-bitwise reproducibility will get fixed by setting the floating-point model to precise or strict.

@micaeljtoliveira
Copy link
Contributor

I can confirm that adding -fp-model=precise to both C++ and Fortran flags along with -O2 when compiling ESMF yields bitwise reproducible runs.

I would say we now have a solution to this problem, so I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build system Build system
Projects
None yet
Development

No branches or pull requests

6 participants