-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add maximum depth for grounding scheme #325
Conversation
The ucar ftp is not responding... |
Our systems guys are working on the machines currently. They should be back up in a couple hours. Dave |
382c745
to
cf2585e
Compare
I triggered travis a couple times today and eventually it worked, so it's passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely understand this change. The documentation says the max grounding depth is 25 meters, but this new threshold is 30 meters. Why are they different?
Also, if there is a landfast/grounded ice / basal stress test in our test suites (there should be!), I would like to see the effect on the test. As @phil-blain points out, this will likely trigger a QC test. I don't expect problems with either one, but would like to have the tests documented as part of this PR.
@phil-blain @JFLemieux73 |
JF will answer as per the documentation. I'm still working on porting. Right now I'm waiting for the UCAR ftp to work so that I can download the forcing for gx1 to run the quality control... |
Ok so I was able to run the base_suite. Here are the results:
the failing tests are the one with the option alt03 active:
I'm having a Python error in numpy when running the QC test : INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626/
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626/
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2b1260d7fd70> ignored
Traceback (most recent call last):
File "./configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
main()
File "./configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
File "./configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
File "./configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
t_crit[x] = t_crit_table[idx]
File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc/lib/python2.7/site-packages/numpy/ma/core.py", line 3329, in __setitem__
_data[indx] = dval
ValueError: setting an array element with a sequence. has any of you seen that before ? it may be that my numpy version is too recent... $ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib'
basemap 1.2.0 py27hf62cb97_3 conda-forge
libnetcdf 4.6.2 h056eaf5_1002 conda-forge
matplotlib 2.2.4 py27_0 conda-forge
matplotlib-base 2.2.4 py27hfd891ef_0 conda-forge
netcdf4 1.5.1.2 py27h73a1b54_1 conda-forge
numpy 1.16.4 py27h95a1406_0 conda-forge
python 2.7.15 h721da81_1008 conda-forge
python-dateutil 2.8.0 py_0 conda-forge Can anyone with a working python environment for the QC test tell me their versions of python, numpy and netcdf so that I can try to create a working environment ? |
@phil-blain @mattdturner I haven't run the qc test for a little while, but at some point, i had some issues with memory use which is noted in the user guide. I don't think you are running into that problem. On conrad, where I've done this before, I had to load a version of python (module load python/gnu/2.7.9) that includes pip and then pip install the packages. If I do that now and then do pip list, I get
I have not run the qc test for a while, so am not sure if the above versions are working on conrad right now. I don't know as much about python as I should, but I wonder if there is a more robust way to specify the environment, reduce the memory and carry out the testing. @mattdturner do you have any thoughts? |
Its been a while since I have run the QC test as well, but I will go ahead and make sure it works on one of the systems that I have access to and let you know what versions of modules I have. |
The QC script works for me on 2 systems that I have tried. System 1:
System 2:
I will try reproducing the error that you are getting to see if I can change the scripts to make them more universally usable. |
I just created a clean environment using the same command that you used (
The QC script ran without error. I then updated the packages to use the most recent packages in
The only difference between our 2 environments appears to be the version of Even if I force anaconda to install
I'm not sure what else could be the problem. I ran the exact cases that you ran (based on the directory names output by the QC script) and have what appears to be an identical environment, yet I am unable to reproduce the error. |
Thanks @mattdturner! Have you been running @phil-blain 's modified code from this PR? If so, please post the QC results, so we can keep this moving while @phil-blain figures out his environment issues. |
Thanks for your help in trying to debug this Matt.. unfortunately I still get the same error with both of the environment you suggested above.... this is very weird... $ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib|mkl'
basemap 1.0.7 np110py27_0 defaults
blas 1.0 mkl defaults
libnetcdf 4.4.1 1 defaults
matplotlib 1.5.1 np110py27_0 defaults
mkl 11.3.3 0 defaults
netcdf4 1.2.4 np110py27_2 conda-forge
numpy 1.10.4 py27_2 defaults
python 2.7.13 0 defaults
python-dateutil 2.6.1 py27_0 defaults
$ configuration/scripts/tests/QC/cice.t-test.py ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626 ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2ac84f532aa0> ignored
Traceback (most recent call last):
File "configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
main()
File "configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
File "configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
File "configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
t_crit[x] = t_crit_table[idx]
File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc-matt-1-free/lib/python2.7/site-packages/numpy/ma/core.py", line 3210, in __setitem__
_data[indx] = dval
ValueError: setting an array element with a sequence. $ conda list | grep -E 'numpy|basemap|python|netcdf|matplotlib|mkl'
basemap 1.2.0 py27h705c2d8_0 defaults
blas 1.0 mkl defaults
libnetcdf 4.6.1 h11d0813_2 defaults
matplotlib 2.2.3 py27hb69df0a_0 defaults
mkl 2019.4 243 defaults
mkl_fft 1.0.12 py27ha843d7b_0 defaults
mkl_random 1.0.2 py27hd81dba3_0 defaults
netcdf4 1.4.2 py27h808af73_0 defaults
numpy 1.15.4 py27h7e9f1db_0 defaults
numpy-base 1.15.4 py27hde5b4d6_0 defaults
python 2.7.16 h9bab390_0 defaults
python-dateutil 2.8.0 py27_0 defaults
$ configuration/scripts/tests/QC/cice.t-test.py ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626 ~/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_base_20190626
INFO:__main__: /home/phb001/data/ppp2/cice/runs/brooks_intel_smoke_gx1_44x1_medium_qc.qc_test_20190626
INFO:__main__:Number of files: 1825
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object maenumerate at 0x2b6c45d5f780> ignored
Traceback (most recent call last):
File "configuration/scripts/tests/QC/cice.t-test.py", line 546, in <module>
main()
File "configuration/scripts/tests/QC/cice.t-test.py", line 498, in main
PASSED, H1_array = two_stage_test(data_base, nfiles, data_diff, files_base[0], dir_a)
File "configuration/scripts/tests/QC/cice.t-test.py", line 188, in two_stage_test
n_eff, H1, r1, t_crit = stage_one(data_d, num_files, mean_d, variance_d)
File "configuration/scripts/tests/QC/cice.t-test.py", line 177, in stage_one
t_crit[x] = t_crit_table[idx]
File "/fs/home/fs1/ords/cmdd/cmde/phb001/logiciels/miniconda3/envs/cice-qc-matt-2/lib/python2.7/site-packages/numpy/ma/core.py", line 3330, in __setitem__
_data[indx] = dval
ValueError: setting an array element with a sequence. I added the mkl version above, maybe this is related, I really don't know... |
I just ran the base suite for both the
However, I did get 11 failures for the test using this branch:
The git hash for my |
I also had failure in the decomp test, but upon rerunning the suite they passed.. I wonder if this has to do with the timing of when each job in this test is ran. |
There are sometimes issues with order of cases run with bfbcomp testing that lead to failures only because the baseline test has not completed first. The test suites try to order tests such that the baselines are setup and submitted well before the ones that will be comparing to minimize the issues, but it's still not perfect. But the decomp test should not experience that. It is a single job that does multiple runs sequentially, comparing results against the first output. It could be that there are some delays in the file system so the baseline file seems to be missing when it isn't. But I'm not sure why two different machines fail this test on the branch when it doesn't happen on master. I have looked at the conrad results from @mattdturner and when I manually compare the runs, the files match so they should pass. I may try to see if I can reproduce this problem, but I test on conrad a lot and have not seen the false negative before for the decomp test. |
I ran the conrad_intel_decomp_gx3_4x2x29x29x5 test on conrad manually using @phil-blain maxdepthGrounding branch and it passed I'm going to run a full suite next on conrad with 4 compilers to see if I can get the decomp test to error in that case. Separately, I think we still need to address @eclare108213 original question of validation. How can it be that the alt03 qc test from @mattdturner is bit-for-bit with the master? Or is that expected? |
I was not expecting the alt03 test to be bit-for-bit, since the ice velocity is changed at the grid points that satisfy the condition that JF added... the only thing I can think of is that maybe Matt and I are not using the same version of the bathymetry file (I'm using the most recent one)... |
I ran a full test suite on conrad comparing vs the current master, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks #cf2585e87d687cc2e3d06 The decomp tests ran fine for me on all compilers. Not sure why that test is failing intermittently. I also had alt03 pass the master comparison for 3 out of 4 compilers. It failed on the cray compiler. I think we expect it should fail on all 4 compilers. Are we not adequately testing the ground scheme? There is one other cray failure for a boxrestore test that I think we can ignore. I know why that failed and it has nothing to do with the PR. |
OK, well maybe the condition that JF added was already implicitly satisfied with the current bathymetry file. Maybe @JFLemieux73 you can tell us if you think that is what is happening ? |
If a new dataset is needed, then we need to make sure we're using that. If we don't already have, we need to make sure new datasets are given new unique filenames. It is NOT OK to overwrite an existing dataset with a new version because then we can never be sure what we've got. If there is a new file, we need to give it a new name (just append v01 or a yymmdd string or something) and we need to make sure that gets dropped onto the ftp site, migrated to all our test machines, and the filename needs to be updated in the run scripts. |
I do not think a new dataset is needed, sorry if I was unclear. I just meant that probably the case that is checked by the new |
@phil-blain I understand there may not be a new dataset, but just wanted to point out if there is, then we need to deal with it properly. I think we're on the same page. The fact that the if test might not be invoked in our alt03 case suggests we may not be exercising the grounding scheme adequately. I also don't understand why some machines/compilers trigger a non bit-for-bit result, but others are bit-for-bit. That's something I think we should understand. I guess I could instrument the new code and have it write to log when it's triggered then run on a couple different compilers and see what happens. Does that make sense? Or I could do a longer run or something. Any thoughts? |
Hi, Here is some background why we are making this change. We limit grounding to a depth of 30 m based on observations of keels. The thickest keels do not exceed 30 m (see Fig 5, Amundrud et al. 2004). Note that this rarely happens in simulations that we get grounding for water depths larger than 30 m (for a realistic sea ice thickness field) but I think this should be included in the code to prevent potential unrealistic grounding events. By the way I will update the documentation... I just made a 1 year test with gx3 comparing the current grounding scheme (no limit on water depth) and the new one (with the 30 m limit). The results are BFB (because there is no grounding for water deeper than 30 m with the current scheme). I had a quick look at Phil's gx1 results. The results are not BFB. I think this is due to the fact that gx1 has an initial sea ice field much thicker (I would say too thick compared to obs) than gx3. There is therefore some grounding for water depths larger than 30 m (which explains the differences). I will investigate more the gx1 case. |
Based on some brief discussions on the monthly telecon, we would like to try to understand this better. We would like to setup a clean bathymetry testcase to check whether results are bit-for-bit and if not, run a qc test on conrad. The namelist will be provided by @phil-blain or @JFLemieux73. Separately, we probably need to decide how much testing the grounding scheme needs moving forward. Do we need to setup a special configuration that triggers the grounding scheme frequently? Does this need to be done with gx1 or does gx3 work? How long do we need to run? Does it depend highly on the initial conditions? I am also trying to have a look at why only 1 of the 4 compilers on conrad produce non bit-for-bit answers with the gx3 alt03 test compared to the current master. |
I ran a bunch of tests on conrad at gx3 for 5 days with different compilers with alt03 with some print statements in the model at the Tbu computation in ice_dyn_shared. First, we are triggering the Tbu computation at several gridpoints at multiple timesteps at gx3. None of those seem to be at "depth" greater than 30m (threshold_hw). That is why gx3 is bit-for-bit for our tests for this change. This change only impacts the computation when hwu >= 30. I also looked at the non-bit-for-bit result with the cray compiler (vs bit-for-bit with intel, pgi, and gnu). That seems to come from a compiler optimization issue. The updated code generates slightly different results occasionally due to compiler optimization. I have verified the code is hitting the Tbu computation at exactly the same gridcells and timesteps as the other compilers. If I run the same tests with debug flags (reduced optimization among other compiler settings), the cray compiler is bit-for-bit with the master. So I'm fairly confident we are just seeing a compiler optimization issue with the cray compiler on conrad and that the science is identical. I also think this means we are adequately testing the grounding scheme / basalstress option with our alt03 gx3 case. It is triggered often. Finally, in order to test the qc of these code changes, we will have to find (or create) a case where hwu >= 30 at some point. And if we find that case, I don't know whether we expect the qc to pass or not. It seems the standard "alt03" gx1 qc test on conrad does not have a case where hwu>=30, @mattdturner indicated his qc test vs master was bit-for-bit. So if we want to do the qc test, we need a case that will result in a non bit-for-bit result. Alternatively, given that we are able to demonstrate bit-for-bit for short and long cases on different grids, that we now understand why results are bit-for-bit so often and why the cray compiler wasn't (not related to science), maybe we should treat this PR as a minor correction/improvement that is almost always bit-for-bit and merge it as is. |
@apcraig I think we should just go ahead and merge this PR. It is just to prevent unrealistic grounding events (not seen in the gx3 case). The parameterization is now more physically realistic (based on observations of ice keels) and we get BFB. I just need to update the documentation (will do it today or tomorrow). |
Thanks for taking a closer look @apcraig. I agree with JF. |
@eclare108213, are you comfortable merging this change at this point? I ran a bunch of tests over the weekend and now understand better what the code change is doing and why we get bit-for-bit most of the time. I'm not sure there is an obvious qc test to do. Separately, I still think we need to understand the problems others are having with running qc on their systems. We probably need to create a issue for that, but I think it's separate at this point. The qc test on conrad was bit-for-bit and I think that is not unexpected. |
I am comfortable with this. Let's make the QC a separate issue. Thanks! |
But let's make sure JF has fixed the documentation to his satisfaction. |
Sounds good. @JFLemieux73, please let us know when the documentation is done and we will merge the PR. |
Ok the documentation has been updated. |
PR checklist
Add a maximum depth at which the basal stress parameterization is active
J-F Lemieux
I have not had the time to completely port to our clusters yet (I'm working on that) but I expect the base_suite to not be BFB since these changes will change the ice velocity at some grid points if the basal stress parameterization is active (probably it will only change at very few grid points though). I've tried to run the quality control run (5 years) on my local workstation but that was too much to ask, it made my computer shut down...
Would it be possible for someone to run the base_suite, as well as the quality control suite ? JF wanted this to make it into the release.. I understand if you have other stuff, and I'll keep working on porting (I'm having trouble getting help about using our custom tools...) If I suceed I will update the PR with the test results.
What would be needed to test this: