Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix restart and clean up pcgsoi #108

Closed
jderber-NOAA opened this issue Feb 1, 2021 · 6 comments · Fixed by #125
Closed

Fix restart and clean up pcgsoi #108

jderber-NOAA opened this issue Feb 1, 2021 · 6 comments · Fixed by #125
Assignees

Comments

@jderber-NOAA
Copy link
Contributor

This update is to fix the restart process, optimize it and clean up pcgsoi.

With the inclusion of ensembles, the restart option was broken. It was relatively straightforward to get it to work, however the cost of the straightforward fix was too high. The addition to the run time and the size of the restart (gesfile) was very large.

I realized that the equality By=x is maintained by the GSI. For that reason, it was only necessary to save yhat and not xhat. In the previous code, both xhat and yhat were save and distributed. So by only saving yhat, the file size was cut in half and the distribution of the data to the various processors cut in half.

Also, the bias correction coefficients are not saved, because for the gfs to gdas process, the radiance bias correction does not have the same number of values because of the monitored data. Also, the aircraft bias correction may not be in the same order or have the same number of values due to the later cut-off of the gdas. Note it is easy to turn on the radiance bias correction coefficients by setting writebias = .true. in jfunc.f90. If the number of bias correction coefficients is different between the write and read, the bias correction coefficient guess will be turned off.

In pcgsoi.f90, the restart guess file is used to define a search direction for the first iteration. When no bias correction coefficients are used in the guess, the original grad values are used. It is not clear yet the best way to use this option. It may be best to add an additional outer iteration with the first outer iteration being just a few iterations to update the guess and use it for the quality control. Or it may be best just to use the guess to reduce the number of iterations necessary.

The code was also updated to remove much of the duplicated code for using precondition or not based on the diag_precon flag. The results should be the same (within roundoff) if diag_precon is false and the diagonal preconditioning is equal to a constant. Note changes to this constant will only result in a commensurate change to the stepsize. The default when diag_precon is equal to false is to use the step_start value. By choosing the step_start value well, the stepsizes should be kept close to 1.

Changes are put into branch restart on jderber's fork.

@jderber-NOAA
Copy link
Contributor Author

One GFS-GDAS experiment from Russ.

gsistat_uvtq_omi.pdf
    gsistat o-g 02 (start of second outer loop) rms for sonde uv, t, and q
    black is the control (00_no_restart).   red is the experiment (00_restart)
gsistat_uvtq_oma.pdf
    gsistat o-g 03 (obs - anl) rms for sonde uv, t, and q
    black is the control (00_no_restart).   red is the experiment (00_restart) 
gfs_gdas_restart.xlsx
    plot reduction in gradient and total penalty as function of iteration count

gsistat_uvtq_omi(2).pdf
gsistat_uvtq_oma.pdf
image
image

@jderber-NOAA
Copy link
Contributor Author

Without restart:

< The total amount of wall time = 2213.345014
< The total amount of time in user mode = 8996.492462
< The total amount of time in sys mode = 803.270299
< The maximum resident set size (KB) = 13323260
< Number of page faults without I/O activity = 12191414

With restart:

The total amount of wall time = 2266.436881
The total amount of time in user mode = 9061.502364
The total amount of time in sys mode = 798.655362
The maximum resident set size (KB) = 13253816
Number of page faults without I/O activity = 12330620

For this run using the restart file used an additional 53sec.

@jderber-NOAA
Copy link
Contributor Author

Looking at PCGSOI.f90 a bit more, I realized that the calling of the control2state routines to create sval and sbias at the beginning of the routine were unnecessary at the beginning of the main pcgsoi loop. Note that the control2state operators are linear, so sval and sbias can be updated within stpcalc. Also, this allows us to get rid of the xhat array. There are 2 calls to control2state after the end of the loop which also can be removed.

While the results should be the same, there are some small round off differences. These changes result in a significant improvement in the wall time.

At the end of the first outer iteration (50 inner iterations) this is the comparison of the results.

is the new run, < is the old run of my version.

cost,grad,step,b,step? = 1 50 1.413599369093726622E+06 7.159630185341438846E+02 1.777947204786721525E+00 5.246177039572396117E-01 good
< cost,grad,step,b,step? = 1 50 1.413599369093728252E+06 7.159630185295321780E+02 1.777947204807709847E+00 5.246177039847547130E-01 good

At the end of the second outer iteration (150 iterations) the results are still very close.

cost,grad,step,b,step? = 2 150 1.457996646935702767E+06 4.217887565227282209E+01 3.625084482845297162E-01 1.188621360912918590E+00 good
< cost,grad,step,b,step? = 2 150 1.457996897047469392E+06 4.147477282199108117E+01 3.722265780298034121E-01 1.130642034505237259E+00 good

The wall time and some of the other measures are nicely reduced.

The total amount of wall time = 3545.876811
The total amount of time in user mode = 11836.560583
The total amount of time in sys mode = 423.385103
The maximum resident set size (KB) = 18212844
Number of page faults without I/O activity = 19061030
Number of page faults with I/O activity = 13
Number of times filesystem performed INPUT = 23136528
Number of times filesystem performed OUTPUT = 133319880
Number of Voluntary Context Switches = 894285
Number of InVoluntary Context Switches = 6029

< The total amount of wall time = 4424.262244
< The total amount of time in user mode = 15382.025460
< The total amount of time in sys mode = 582.312623
< The maximum resident set size (KB) = 18214936
< Number of page faults without I/O activity = 61294098
< Number of page faults with I/O activity = 21
< Number of times filesystem performed INPUT = 23657792
< Number of times filesystem performed OUTPUT = 133319856
< Number of Voluntary Context Switches = 903802
< Number of InVoluntary Context Switches = 21300

Note that both of these runs wrote a restart file. I would expect the wall times to be reduced by the same amount for both runs if the write was removed.

Both runs were made on Orion.

Russ made a test on the operational machine (Venus) without writing a restart file. Results from Russ below:

The suggested test has been run on Venus for 2021020912. The gsi was run in the proposed operational v16 configuration: 1000 pe, 250 nodes, ptile=4, threads=7. The restart global_gsi.x was run with iguess = -1 (no restart).

The v16 gfs gsi runs 2 outer loops, 50 & 100 iterations. Here are the wall times for the gfs gsi.

master global_gsi.x:  1667.197777 seconds = 27.79 minutes
restart global_gsi.x:   1439.389504 seconds = 23.99 minutes

The wall time reduction is impressive. Here are the final "cost,grad" printout from the two runs

master: cost,grad,step,b,step? = 2 100 1.218883268288697116E+06 1.066994509267928493E+01 2.022357024387213986E+00 8.657558371966026511E-01 good
restart: cost,grad,step,b,step? = 2 100 1.218883282529345481E+06 1.066984752137133796E+01 2.022368247684526388E+00 8.657532670552655629E-01 good

These terms are very similar after 150 total iterations

The same cycle was repeated but using gdas dumps and the gdas configuration (50, 150 iterations). Here are the results:

master:  2176.425668 seconds = 36.27 minutes
restart:  1919.704149 seconds = 32.00 minutes

master: cost,grad,step,b,step? = 2 150 1.452479838349013124E+06 1.828125936910788285E+00 1.132785455596231117E+00 9.418740811004355784E-01 good
restart: cost,grad,step,b,step? = 2 150 1.452479847812221386E+06 1.274530025797174781E+00 1.556014467633248577E+00 7.985897869397546867E-01 good

Another impressive reduction in wall time with very similar fort.220 numbers on the last iteration.

To put these numbers in perspective, below are the min,max range of 12Z gfs and gdas analysis gsi wall times from v15 operations over the first 9 days of February 2021.

gfs_analysis:  1550.990224 to 1569.103925 seconds = 25.85 to 26.15 minutes
gdas_analysis:  1689.172432 to 1709.690575 seconds   = 28.15 to 28.49 minutes

The restart branch gets the v16 gdas analysis wall time within 5 minutes of v15. The v15 gfs analysis runs 50, 150 iterations. Hence, the longer v15 wall time with respect to v16. If we were to run the v16 gfs analysis with 50, 150 iterations, we'd likely see a similar within 5 minute result for the restart branch, too.

@jderber-NOAA
Copy link
Contributor Author

Additional changes were made to optimize the reading of the ensemble. Several changes were made to reduce unnecessary data motion. Also, it was noted that the pole points were determined after reading in the fields, distributing to the processors, converting to real and then calculating pole point values. Instead, this can be done more efficiently by creating pole point values right after reading it in. This calculation is done on the read in single precision numbers (using double precision arithmetic).

Small differences are noted in the pole points (first line north pole, second south pole). e.g.,

208.156585693359 208.156583815813
204.876083374023 204.876090387503

1.512769813416526E-006 1.512769825721497E-006
-8.304822910433360E-020 -8.304822910433360E-020

4.136569131674150E-008 4.136569120571920E-008
6.392134821453510E-008 6.392135071253691E-008

U,V (U north pole, V north pole, U south pole, V south pole)

-75.6385803222656 8.52435111999512 -5.14269161224365 5.76587772369385
-75.6385836263041 8.52435155020824 -5.14269148635113 5.76587759163175

Comparisons are for north and south pole and different variables.
Note the small change is only at the poles

Also, received message from Manual Pondeca saying that the gram-schmidt stuff was no longer necessary. Removed from branch.

@jderber-NOAA
Copy link
Contributor Author

Regression tests performed on Hera. Results are not identical except for
Test #17: netcdf_fv3_regional
Test #16: global_enkf_T62
The differences for the rest were non-zero, but relatively small and were expected.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS_D Regression Tests

Run the authoritative master at 9c1fc15 and forked restart at 5618311 on Mars. The following tests Passed.

  • global_C96_fv3aero
  • netcdf_fv3_regional
  • global_enkf_T62

The remaining tests returned Failed. Examination of the failed cases show that for each the master and restart global_gsi.x compute identical total cost functions. Identical refers to the cost function value printed in stdout. Below are the final restart and master total cost functions printed to stdout (also fort.220) for each test. Identical digits for each test are in bold font.

test restart master iteration (outer x inner)
arw_binary 1.377741172865899898E+04 1.377747426331427596E+04 2 x 50
arw_netcdf 7.548777363998239161E+04 7.542587106288851646E+04 2 x 50
global_4denvar_T126 1.543064699663871434E+06 1.543064699663871434E+06 1 x 5
global_4dvar_T62 1.042547841994065791E+06 1.042547841994065559E+06 lsqrtb= T
global_C96_fv3aerorad 9.191529367441907525E+05 9.191529373517655768E+05 1 x 100
global_fv3_4denvar_C192 4.648960975475465530E+05 4.648960975469609257E+05 2 x 5
global_fv3_4denvar_T126 7.196634019099477446E+05 7.196634019099441357E+05 2 x 5
global_lanczos_T62 9.774831188391845208E+05 9.774830356387289939E+05 lsqrtb= T
global_T62_ozonly 2.360963435394903627E+03 2.360963434939456874E+03 (100, 150)
global_T62 1.415790636887133587E+06 1.415791106965345098E+06 (100, 150)
hwrf_nmm_d2 2.780204552725836402E+04 2.780204049984450103E+04 2 x 50
hwrf_nmm_d3 2.232718047840750842E+03 2.232748150726719359E+03 2 x 50
nmm_binary 2.773286235806519398E+05 2.773099902247575228E+05 2 x 50
nmmb_nems_4denvar 5.661190307638054946E+05 5.661190528588995803E+05 2 x 50
nmm_netcdf 3.564990573501897597E+04 3.564617700362903270E+04 2 x 50
rtma 1.288410589426600782E+05 1.288410577620275872E+05 2 x 10

While the final total penalties for global_4denvar_T126 remain identical to 19 printed digits, the gradient values differ as shown below

test master restart iteration (outer x inner)
global_4denvar_T126 4.088137297942034365E+03 4.088137297942035730E+03 1 x 5

As John noted, differences are expected and are relatively small.

MichaelLueken added a commit to MichaelLueken/GSI that referenced this issue Mar 16, 2021
MichaelLueken added a commit to MichaelLueken/GSI that referenced this issue Mar 23, 2021
MichaelLueken added a commit to MichaelLueken/GSI that referenced this issue Mar 26, 2021
MichaelLueken added a commit to MichaelLueken/GSI that referenced this issue Mar 30, 2021
MichaelLueken added a commit to MichaelLueken/GSI that referenced this issue Mar 31, 2021
MichaelLueken added a commit that referenced this issue Apr 8, 2021
GitHub Issue #108. Fix restart and clean up pcgsoi.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants