Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add maximum depth for grounding scheme #325

Merged
merged 4 commits into from
Jul 19, 2019

Conversation

phil-blain
Copy link
Member

PR checklist

  • Short (1 sentence) summary of your PR:
    Add a maximum depth at which the basal stress parameterization is active
  • Developer(s):
    J-F Lemieux
  • Suggest PR reviewers from list in the column to the right. @eclare108213 @apcraig
  • Please copy the PR test results link or provide a summary of testing completed below.
    I have not had the time to completely port to our clusters yet (I'm working on that) but I expect the base_suite to not be BFB since these changes will change the ice velocity at some grid points if the basal stress parameterization is active (probably it will only change at very few grid points though). I've tried to run the quality control run (5 years) on my local workstation but that was too much to ask, it made my computer shut down...
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit
    • different at roundoff level
    • more substantial -> probably, not verified
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

Would it be possible for someone to run the base_suite, as well as the quality control suite ? JF wanted this to make it into the release.. I understand if you have other stuff, and I'll keep working on porting (I'm having trouble getting help about using our custom tools...) If I suceed I will update the PR with the test results.

What would be needed to test this:

  1. run the base_suite (it will probably not be BFB for the tests where the basalstress is active, i.e. any tests with options alt01, alt03 or boxrestore)
  2. run the quality control test (baseline + test), modifying the namelist for both as follows:
    use_bathymetry=.true.
    basalstress=.true.
    
    so that we check if the new code changes will affect the statistical response if the basal stress is active.

@phil-blain
Copy link
Member Author

The ucar ftp is not responding...

@dabail10
Copy link
Contributor

Our systems guys are working on the machines currently. They should be back up in a couple hours.

Dave

@apcraig
Copy link
Contributor

apcraig commented Jun 20, 2019

I triggered travis a couple times today and eventually it worked, so it's passing.

@apcraig apcraig requested review from eclare108213 and dabail10 June 20, 2019 00:03
Copy link
Contributor

@dabail10 dabail10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand this change. The documentation says the max grounding depth is 25 meters, but this new threshold is 30 meters. Why are they different?

Also, if there is a landfast/grounded ice / basal stress test in our test suites (there should be!), I would like to see the effect on the test. As @phil-blain points out, this will likely trigger a QC test. I don't expect problems with either one, but would like to have the tests documented as part of this PR.

@eclare108213
Copy link
Contributor

@phil-blain @JFLemieux73
Please answer the documentation question above.
Also, it looks like there is one test with basal=T and use_bathymetry=T (alt03, which is included in the base_suite). It's looks like it's a short test though. Is this sufficient? What did you use to test this change?

@phil-blain
Copy link
Member Author

phil-blain commented Jun 27, 2019

JF will answer as per the documentation. I'm still working on porting. Right now I'm waiting for the UCAR ftp to work so that I can download the forcing for gx1 to run the quality control...
I will try to run just this alt03 test and post the result.

@phil-blain
Copy link
Member Author

Ok so I was able to run the base_suite. Here are the results:
base_suite:

212 of 214 tests PASSED
2 of 214 tests FAILED
0 of 214 tests PENDING

the failing tests are the one with the option alt03 active:

FAIL brooks_intel_restart_gx3_4x2_alt03 compare master-20190626 34.52 25.35 5.65 different-data
FAIL brooks_intel_restart_gx3_4x2_alt03_debug_short compare master-20190626 252.12 185.40 45.51 different-data

I'm having a Python error in numpy when running the QC test :

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626/
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626/
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2b1260d7fd70> ignored
Traceback (most recent call last):
  File "./configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
    main()
  File "./configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
    PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
  File "./configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
    n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
  File "./configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
    t_crit[x] = t_crit_table[idx]
  File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc/lib/python2.7/site-packages/numpy/ma/core.py", line 3329, in __setitem__
    _data[indx] = dval
ValueError: setting an array element with a sequence.

has any of you seen that before ? it may be that my numpy version is too recent...
I created a clean conda environment for installing the needed packages (conda create -n cice-qc python=2.7 netCDF4 numpy matplotlib basemap) and it installed the following versions :

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.2.0            py27hf62cb97_3    conda-forge
libnetcdf                 4.6.2             h056eaf5_1002    conda-forge
matplotlib                2.2.4                    py27_0    conda-forge
matplotlib-base           2.2.4            py27hfd891ef_0    conda-forge
netcdf4                   1.5.1.2          py27h73a1b54_1    conda-forge
numpy                     1.16.4           py27h95a1406_0    conda-forge
python                    2.7.15            h721da81_1008    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge

Can anyone with a working python environment for the QC test tell me their versions of python, numpy and netcdf so that I can try to create a working environment ?

@apcraig
Copy link
Contributor

apcraig commented Jul 2, 2019

@phil-blain @mattdturner I haven't run the qc test for a little while, but at some point, i had some issues with memory use which is noted in the user guide. I don't think you are running into that problem. On conrad, where I've done this before, I had to load a version of python (module load python/gnu/2.7.9) that includes pip and then pip install the packages. If I do that now and then do pip list, I get

conrad03 home/apcraig> pip list
Package Version


attrs 18.1.0
backports.functools-lru-cache 1.5
cftime 1.0.1
cycler 0.10.0
Cython 0.23.4
funcsigs 1.0.2
kiwisolver 1.0.1
matplotlib 2.2.3
more-itertools 4.1.0
netCDF4 1.4.1
nose 1.3.7
numpy 1.15.1
pip 10.0.1
pluggy 0.6.0
py 1.5.3
pyparsing 2.3.0
pytest 3.5.1
python-dateutil 2.7.5
pytz 2018.7
setuptools 18.8
six 1.10.0
subprocess32 3.5.3
wheel 0.26.0

I have not run the qc test for a while, so am not sure if the above versions are working on conrad right now. I don't know as much about python as I should, but I wonder if there is a more robust way to specify the environment, reduce the memory and carry out the testing. @mattdturner do you have any thoughts?

@mattdturner
Copy link
Contributor

Can anyone with a working python environment for the QC test tell me their versions of python, numpy and netcdf so that I can try to create a working environment ?

Its been a while since I have run the QC test as well, but I will go ahead and make sure it works on one of the systems that I have access to and let you know what versions of modules I have.

@mattdturner
Copy link
Contributor

The QC script works for me on 2 systems that I have tried.

System 1:

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.0.7               np110py27_0    https://repo.continuum.io/pkgs/free
libnetcdf                 4.3.3.1                       3    https://repo.continuum.io/pkgs/free
matplotlib                1.5.1               np110py27_0    https://repo.anaconda.com/pkgs/free
netcdf4                   1.2.4               np110py27_1    conda-forge
numpy                     1.10.4                   py27_2    https://repo.anaconda.com/pkgs/free
python                    2.7.13                        0    https://repo.continuum.io/pkgs/free
python-dateutil           2.4.2                    py27_0    https://repo.continuum.io/pkgs/free

System 2:

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.2.0            py27h705c2d8_0  
libnetcdf                 4.6.1                h11d0813_2  
matplotlib                2.2.3            py27hb69df0a_0  
netcdf4                   1.4.2            py27h808af73_0  
numpy                     1.15.4           py27h1d66e8a_0  
numpy-base                1.15.4           py27h81de0dd_0  
python                    2.7.16               h9bab390_0  
python-dateutil           2.8.0                    py27_0

I will try reproducing the error that you are getting to see if I can change the scripts to make them more universally usable.

@mattdturner
Copy link
Contributor

I just created a clean environment using the same command that you used (conda create -n cice-qc python=2.7 netCDF4 numpy matplotlib basemap).

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.2.0            py27h705c2d8_0  
libnetcdf                 4.6.1                h11d0813_2  
matplotlib                2.2.3            py27hb69df0a_0  
netcdf4                   1.4.2            py27h808af73_0  
numpy                     1.16.4           py27h7e9f1db_0  
numpy-base                1.16.4           py27hde5b4d6_0  
python                    2.7.16               h9bab390_0  
python-dateutil           2.8.0                    py27_0

The QC script ran without error.

I then updated the packages to use the most recent packages in conda-forge and was still unable to replicate the crash.

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.2.0            py27h673bf1a_2    conda-forge
libnetcdf                 4.6.2             h056eaf5_1002    conda-forge
matplotlib                2.2.3            py27hb69df0a_0  
netcdf4                   1.5.1.2          py27h73a1b54_1    conda-forge
numpy                     1.16.4           py27h95a1406_0    conda-forge
python                    2.7.15            h5a48372_1009    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge

The only difference between our 2 environments appears to be the version of matplotlib.

Even if I force anaconda to install matplotlib v2.2.4 (via conda install matplotlib=2.2.4), I am still unable to replicate the crash.

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap                   1.2.0            py27hf62cb97_3    conda-forge
libnetcdf                 4.6.2             h056eaf5_1002    conda-forge
matplotlib                2.2.4                    py27_0    conda-forge
matplotlib-base           2.2.4            py27hfd891ef_0    conda-forge
netcdf4                   1.5.1.2          py27h73a1b54_1    conda-forge
numpy                     1.16.4           py27h95a1406_0    conda-forge
python                    2.7.15            h5a48372_1009    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge

I'm not sure what else could be the problem. I ran the exact cases that you ran (based on the directory names output by the QC script) and have what appears to be an identical environment, yet I am unable to reproduce the error.

@eclare108213
Copy link
Contributor

Thanks @mattdturner! Have you been running @phil-blain 's modified code from this PR? If so, please post the QC results, so we can keep this moving while @phil-blain figures out his environment issues.

@phil-blain
Copy link
Member Author

Thanks for your help in trying to debug this Matt.. unfortunately I still get the same error with both of the environment you suggested above.... this is very weird...

$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib|mkl'
basemap                   1.0.7               np110py27_0    defaults
blas                      1.0                         mkl    defaults
libnetcdf                 4.4.1                         1    defaults
matplotlib                1.5.1               np110py27_0    defaults
mkl                       11.3.3                        0    defaults
netcdf4                   1.2.4               np110py27_2    conda-forge
numpy                     1.10.4                   py27_2    defaults
python                    2.7.13                        0    defaults
python-dateutil           2.6.1                    py27_0    defaults

$ configuration/scripts/tests/QC/cice.t-test.py ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626 ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2ac84f532aa0> ignored
Traceback (most recent call last):
  File "configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
    main()
  File "configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
    PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
  File "configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
    n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
  File "configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
    t_crit[x] = t_crit_table[idx]
  File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc-matt-1-free/lib/python2.7/site-packages/numpy/ma/core.py", line 3210, in __setitem__
    _data[indx] = dval
ValueError: setting an array element with a sequence.
$ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib|mkl'
basemap                   1.2.0            py27h705c2d8_0    defaults
blas                      1.0                         mkl    defaults
libnetcdf                 4.6.1                h11d0813_2    defaults
matplotlib                2.2.3            py27hb69df0a_0    defaults
mkl                       2019.4                      243    defaults
mkl_fft                   1.0.12           py27ha843d7b_0    defaults
mkl_random                1.0.2            py27hd81dba3_0    defaults
netcdf4                   1.4.2            py27h808af73_0    defaults
numpy                     1.15.4           py27h7e9f1db_0    defaults
numpy-base                1.15.4           py27hde5b4d6_0    defaults
python                    2.7.16               h9bab390_0    defaults
python-dateutil           2.8.0                    py27_0    defaults
$ configuration/scripts/tests/QC/cice.t-test.py ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626 ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626
INFO:__main__:  /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2b6c45d5f780> ignored
Traceback (most recent call last):
  File "configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
    main()
  File "configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
    PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
  File "configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
    n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
  File "configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
    t_crit[x] = t_crit_table[idx]
  File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc-matt-2/lib/python2.7/site-packages/numpy/ma/core.py", line 3330, in __setitem__
    _data[indx] = dval
ValueError: setting an array element with a sequence.

I added the mkl version above, maybe this is related, I really don't know...

@mattdturner
Copy link
Contributor

I just ran the base suite for both the master branch in the CICE-Consortium repository, and ran the base suite for the maxdepthGrounding branch in this pull request. The results.csh script reports that the cases in question are bit-for-bit on my system:

PASS conrad_intel_restart_gx3_4x2_alt03 compare master_20190703 79.70 66.96 6.96
PASS conrad_intel_restart_gx3_4x2_alt03_debug_short compare master_20190703 334.47 263.68 30.03

However, I did get 11 failures for the test using this branch:

FAIL conrad_intel_decomp_gx3_4x2x25x29x5_squareice bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_slenderX2 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_slenderX1 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_roundrobin bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_sectcart bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_sectrobin bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_spacecurve bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeX2 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeX1 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakepop bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeice bfbcomp conrad_intelFAIL conrad_intel_decomp_gx3_4x2x25x29x5_squareice bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_slenderX2 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_slenderX1 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_roundrobin bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_sectcart bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_sectrobin bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_spacecurve bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeX2 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeX1 bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakepop bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop
FAIL conrad_intel_decomp_gx3_4x2x25x29x5_rakeice bfbcomp conrad_intel_decomp_gx3_4x2x25x29x5_squarepop_decomp_gx3_4x2x25x29x5_squarepop

The git hash for my master branch is a6b6c3c417590ee54348525fc04605a30526d562, and the hash for the branch in this PR is cf2585e87d687cc2e3d06847fdb5706e68261754. So both branches are up to date. I'm not familiar enough with the various bfbcomp tests to comment on what might be causing the failures. However, the failures were not experienced in the master branch.

@phil-blain
Copy link
Member Author

I also had failure in the decomp test, but upon rerunning the suite they passed.. I wonder if this has to do with the timing of when each job in this test is ran.
Regarding the tests with alt03, I don't really understand why it would pass in your system but not on ours...

@apcraig
Copy link
Contributor

apcraig commented Jul 4, 2019

There are sometimes issues with order of cases run with bfbcomp testing that lead to failures only because the baseline test has not completed first. The test suites try to order tests such that the baselines are setup and submitted well before the ones that will be comparing to minimize the issues, but it's still not perfect. But the decomp test should not experience that. It is a single job that does multiple runs sequentially, comparing results against the first output. It could be that there are some delays in the file system so the baseline file seems to be missing when it isn't. But I'm not sure why two different machines fail this test on the branch when it doesn't happen on master. I have looked at the conrad results from @mattdturner and when I manually compare the runs, the files match so they should pass. I may try to see if I can reproduce this problem, but I test on conrad a lot and have not seen the false negative before for the decomp test.

@apcraig
Copy link
Contributor

apcraig commented Jul 4, 2019

I ran the conrad_intel_decomp_gx3_4x2x29x29x5 test on conrad manually using @phil-blain maxdepthGrounding branch and it passed I'm going to run a full suite next on conrad with 4 compilers to see if I can get the decomp test to error in that case.

Separately, I think we still need to address @eclare108213 original question of validation. How can it be that the alt03 qc test from @mattdturner is bit-for-bit with the master? Or is that expected?

@phil-blain
Copy link
Member Author

I was not expecting the alt03 test to be bit-for-bit, since the ice velocity is changed at the grid points that satisfy the condition that JF added... the only thing I can think of is that maybe Matt and I are not using the same version of the bathymetry file (I'm using the most recent one)...

@apcraig
Copy link
Contributor

apcraig commented Jul 4, 2019

I ran a full test suite on conrad comparing vs the current master,

https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks

#cf2585e87d687cc2e3d06

The decomp tests ran fine for me on all compilers. Not sure why that test is failing intermittently. I also had alt03 pass the master comparison for 3 out of 4 compilers. It failed on the cray compiler. I think we expect it should fail on all 4 compilers. Are we not adequately testing the ground scheme? There is one other cray failure for a boxrestore test that I think we can ignore. I know why that failed and it has nothing to do with the PR.

@phil-blain
Copy link
Member Author

OK, well maybe the condition that JF added was already implicitly satisfied with the current bathymetry file. Maybe @JFLemieux73 you can tell us if you think that is what is happening ?
Even so, in that case I think it should pass on all compilers.

@apcraig
Copy link
Contributor

apcraig commented Jul 4, 2019

If a new dataset is needed, then we need to make sure we're using that. If we don't already have, we need to make sure new datasets are given new unique filenames. It is NOT OK to overwrite an existing dataset with a new version because then we can never be sure what we've got. If there is a new file, we need to give it a new name (just append v01 or a yymmdd string or something) and we need to make sure that gets dropped onto the ftp site, migrated to all our test machines, and the filename needs to be updated in the run scripts.

@phil-blain
Copy link
Member Author

I do not think a new dataset is needed, sorry if I was unclear. I just meant that probably the case that is checked by the new if clause that JF added was never encountered, so in that case it is normal for the alt03 test to be bit-for-bit.

@apcraig
Copy link
Contributor

apcraig commented Jul 5, 2019

@phil-blain I understand there may not be a new dataset, but just wanted to point out if there is, then we need to deal with it properly. I think we're on the same page.

The fact that the if test might not be invoked in our alt03 case suggests we may not be exercising the grounding scheme adequately. I also don't understand why some machines/compilers trigger a non bit-for-bit result, but others are bit-for-bit. That's something I think we should understand. I guess I could instrument the new code and have it write to log when it's triggered then run on a couple different compilers and see what happens. Does that make sense? Or I could do a longer run or something. Any thoughts?

@JFLemieux73
Copy link
Contributor

Hi,

Here is some background why we are making this change. We limit grounding to a depth of 30 m based on observations of keels. The thickest keels do not exceed 30 m (see Fig 5, Amundrud et al. 2004). Note that this rarely happens in simulations that we get grounding for water depths larger than 30 m (for a realistic sea ice thickness field) but I think this should be included in the code to prevent potential unrealistic grounding events. By the way I will update the documentation...

I just made a 1 year test with gx3 comparing the current grounding scheme (no limit on water depth) and the new one (with the 30 m limit). The results are BFB (because there is no grounding for water deeper than 30 m with the current scheme).

I had a quick look at Phil's gx1 results. The results are not BFB. I think this is due to the fact that gx1 has an initial sea ice field much thicker (I would say too thick compared to obs) than gx3. There is therefore some grounding for water depths larger than 30 m (which explains the differences). I will investigate more the gx1 case.

@apcraig
Copy link
Contributor

apcraig commented Jul 13, 2019

Based on some brief discussions on the monthly telecon, we would like to try to understand this better. We would like to setup a clean bathymetry testcase to check whether results are bit-for-bit and if not, run a qc test on conrad. The namelist will be provided by @phil-blain or @JFLemieux73.

Separately, we probably need to decide how much testing the grounding scheme needs moving forward. Do we need to setup a special configuration that triggers the grounding scheme frequently? Does this need to be done with gx1 or does gx3 work? How long do we need to run? Does it depend highly on the initial conditions?

I am also trying to have a look at why only 1 of the 4 compilers on conrad produce non bit-for-bit answers with the gx3 alt03 test compared to the current master.

@apcraig
Copy link
Contributor

apcraig commented Jul 13, 2019

I ran a bunch of tests on conrad at gx3 for 5 days with different compilers with alt03 with some print statements in the model at the Tbu computation in ice_dyn_shared. First, we are triggering the Tbu computation at several gridpoints at multiple timesteps at gx3. None of those seem to be at "depth" greater than 30m (threshold_hw). That is why gx3 is bit-for-bit for our tests for this change. This change only impacts the computation when hwu >= 30.

I also looked at the non-bit-for-bit result with the cray compiler (vs bit-for-bit with intel, pgi, and gnu). That seems to come from a compiler optimization issue. The updated code generates slightly different results occasionally due to compiler optimization. I have verified the code is hitting the Tbu computation at exactly the same gridcells and timesteps as the other compilers. If I run the same tests with debug flags (reduced optimization among other compiler settings), the cray compiler is bit-for-bit with the master. So I'm fairly confident we are just seeing a compiler optimization issue with the cray compiler on conrad and that the science is identical.

I also think this means we are adequately testing the grounding scheme / basalstress option with our alt03 gx3 case. It is triggered often.

Finally, in order to test the qc of these code changes, we will have to find (or create) a case where hwu >= 30 at some point. And if we find that case, I don't know whether we expect the qc to pass or not. It seems the standard "alt03" gx1 qc test on conrad does not have a case where hwu>=30, @mattdturner indicated his qc test vs master was bit-for-bit. So if we want to do the qc test, we need a case that will result in a non bit-for-bit result. Alternatively, given that we are able to demonstrate bit-for-bit for short and long cases on different grids, that we now understand why results are bit-for-bit so often and why the cray compiler wasn't (not related to science), maybe we should treat this PR as a minor correction/improvement that is almost always bit-for-bit and merge it as is.

@JFLemieux73
Copy link
Contributor

@apcraig I think we should just go ahead and merge this PR. It is just to prevent unrealistic grounding events (not seen in the gx3 case). The parameterization is now more physically realistic (based on observations of ice keels) and we get BFB. I just need to update the documentation (will do it today or tomorrow).

@phil-blain
Copy link
Member Author

Thanks for taking a closer look @apcraig. I agree with JF.

@apcraig
Copy link
Contributor

apcraig commented Jul 15, 2019

@eclare108213, are you comfortable merging this change at this point? I ran a bunch of tests over the weekend and now understand better what the code change is doing and why we get bit-for-bit most of the time. I'm not sure there is an obvious qc test to do.

Separately, I still think we need to understand the problems others are having with running qc on their systems. We probably need to create a issue for that, but I think it's separate at this point.

The qc test on conrad was bit-for-bit and I think that is not unexpected.

@eclare108213
Copy link
Contributor

I am comfortable with this. Let's make the QC a separate issue. Thanks!
e

@eclare108213
Copy link
Contributor

But let's make sure JF has fixed the documentation to his satisfaction.

@apcraig
Copy link
Contributor

apcraig commented Jul 15, 2019

Sounds good. @JFLemieux73, please let us know when the documentation is done and we will merge the PR.

@JFLemieux73
Copy link
Contributor

Ok the documentation has been updated.

@phil-blain phil-blain mentioned this pull request Jul 16, 2019
@apcraig apcraig merged commit 647aa1f into CICE-Consortium:master Jul 19, 2019
@phil-blain phil-blain deleted the maxdepthGrounding branch August 30, 2019 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants