Fix bit for bit error in OpenMP IEVA #1442

davegill · 2021-03-25T21:37:14Z

TYPE: bug fix

KEYWORDS: IEVA, OpenMP

SOURCE: internal

DESCRIPTION OF CHANGES:
Problem:
With IEVA activated, differences appear between OpenMP results with OMP_NUM_THREADS > 1 and any of the following:

serial results
MPI results
OpenMP results with a single thread

Solution:
Working backwards, the computation in WRF (pre-IEVA code) computed the full MU field only on the mass-point tile size:

DO j = jts, jte-1
DO i = its, ite-1

We extend the computation one grid cell to the left and right:

DO j = jts-1, jte-1
DO i = its-1, ite-1

Since WRF previously did not use those values, that is not a problem to have additional rows and columns of valid data inside of the halo region.

This is a follow-on PR to 4412521 #1373 "Implicit Explicit Vertical Advection (IEVA)".

LIST OF MODIFIED FILES:
M dyn_em/module_big_step_utilities_em.F

TESTS CONDUCTED:

I used a simple Jan 2000 case, with 60 levels, 30-km resolution, and a 20*dx time step. This caused calls to advect_u_implicit and advect_v_implicit in the first time step. Without the mods, the code generated different results depending on the number of OpenMP threads. With the mods, the results are bit-for-bit for OpenMP with the standard y-only decomposition and with a manual x-only decomposition.

Below is a figure of the differences of the V field after the first time step (before the modification). This plot is the difference of the same executable using two different OMP_NUM_THREADS values. After the mod, the results are bit-for-bit.

Before the mods, during the first time step, the following diffs were apparent along the OpenMP boundaries:

Diffing np=1/wrfout_d01_2000-01-24_12:10:00 np=6/wrfout_d01_2000-01-24_12:10:00
 Next Time 2000-01-24_12:10:00
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
         U     49384    3   0.2158843112E+02   0.2158843113E+02   9   0.5344E-05   0.3589E-05
         V     61738    3   0.1834835473E+02   0.1834835712E+02   6   0.1045E-03   0.2183E-03
         W    139132    3   0.4977466348E-01   0.4977466098E-01   7   0.3382E-05   0.4809E-03
        PH     66955    3   0.2327166773E+04   0.2327166753E+04   8   0.1078E-02   0.7572E-05
         T      4838    3   0.7925254902E+02   0.7925254902E+02  12   0.9349E-05   0.2484E-05
       THM      4812    3   0.7921679023E+02   0.7921679023E+02  12   0.9289E-05   0.2484E-05
        MU      1286    2   0.1460135950E+04   0.1460135956E+04   8   0.1203E-02   0.5148E-05
         P      6737    3   0.6512715435E+03   0.6512716390E+03   6   0.2086E-01   0.8162E-03
    QVAPOR     26582    3   0.2913825518E-02   0.2913825518E-02   9   0.4536E-09   0.5671E-05
    QCLOUD       429    3   0.6474288021E-05   0.6474289263E-05   6   0.3257E-09   0.3024E-03
      QICE       715    3   0.4136477606E-05   0.4136463263E-05   5   0.1303E-09   0.1757E-03
     QNICE       676    3   0.4164261806E+06   0.4164261805E+06   9   0.1341E+00   0.1125E-05
    RAINNC        94    2   0.3158246772E-02   0.3158239178E-02   5   0.9447E-07   0.1558E-03
    SNOWNC        94    2   0.3158246772E-02   0.3158239178E-02   5   0.9447E-07   0.1558E-03
        SR         1    2   0.3353836226E+00   0.3353836226E+00   9   0.9006E-09   0.5960E-07

Wei successfully tested a separate case with 1x16 and 16x1 OpenMP decompositions, where there were bit-for-bit diffs without the mods.
Jenkins tests are all PASS.

modified: dyn_em/module_big_step_utilities_em.F

davegill · 2021-03-25T21:52:29Z

@weiwangncar @dudhia @louiswicker
This works for a couple of small cases for me, and for a more reasonable case for Wei. We are heading towards optimism.

davegill · 2021-03-25T22:03:02Z

@Plantain
For the IEVA code, here's a small bug fix that addresses the OpenMP bit-for-bit differences.

weiwangncar · 2021-03-25T22:09:31Z

@louiswicker You may want to see if this helps with the 'bug' you were chasing a week or two ago.

davegill · 2021-03-25T22:11:32Z

jenkins results:

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 7361a031b36e9f887cfc9941dda823363491871c, requested by: davegill for PR: https://github.com/wrf-model/WRF/pull/1442. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 163           161        0
    Number of Comparisons  : 103           102        0

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    None

louiswicker · 2021-03-25T22:37:22Z

Wow - thanks Dave - will try it tonight!

davegill · 2021-03-26T18:16:33Z

@weiwangncar
Wei,
This is ready for a review. It is a small change. It fixes the problem OpenMP problem. The solution introduces valid data into the halo of memory sized "mu" fields.

louiswicker · 2021-03-26T20:10:11Z

Is there a way for me to pull this fixed code since its not merged?

NVM - I figured out how to do this from dave.gills repo

weiwangncar · 2021-03-26T20:16:58Z

@louiswicker Yes, Lou. Do this
git clone https://github.com/davegill/WRF.git
Once you have it, do 'git checkout ieva_bf' to see Dave's change.

louiswicker · 2021-03-26T20:18:15Z

Yes, thanks - I need to think before posting. got it and compiling now. Lou

…

On Mar 26, 2021, at 3:17 PM, weiwangncar ***@***.***> wrote: @louiswicker <https://github.com/louiswicker> Yes, Lou. Do this git clone https://github.com/davegill/WRF.git <https://github.com/davegill/WRF.git> Once you have it, do 'git checkout ieva_bf' to see Dave's change. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADUKPV65EVRBWEMAGFBMB3TFTTUVANCNFSM4Z2FNC4Q>.

---------------------------------------------------------------------------- | Dr. Louis J. Wicker | NSSL/FRDD Rm 3336 | National Weather Center | 120 David L. Boren Boulevard, Norman, OK 73072 | | E-mail: ***@***.*** ***@***.***> | HTTP: www.nssl.noaa.gov/~lwicker <http://www.nssl.noaa.gov/~lwicker> | Phone: (405) 325-6340 | Fax: (405) 325-2316 | I "Yet all is not lost," Francis said. "Human beings, while | capable of the worst, are also capable of rising above | themselves, choosing again what is good, and making | a new start, despite their mental and social conditioning." | | Pope Francis ---------------------------------------------------------------------------- | | "The contents of this message are mine personally and | do not reflect any position of the Government or NOAA." | ----------------------------------------------------------------------------

TYPE: bug fix KEYWORDS: IEVA, OpenMP SOURCE: internal DESCRIPTION OF CHANGES: Problem: With IEVA activated, differences appear between OpenMP results with OMP_NUM_THREADS > 1 and any of the following: 1. serial results 2. MPI results 3. OpenMP results with a single thread Solution: Working backwards, the computation in WRF (pre-IEVA code) computed the full MU field only on the mass-point tile size: ``` DO j = jts, jte-1 DO i = its, ite-1 ``` We extend the computation one grid cell to the left and right: ``` DO j = jts-1, jte-1 DO i = its-1, ite-1 ``` Since WRF previously did not use those values, that is not a problem to have additional rows and columns of valid data inside of the halo region. This is a follow-on PR to 4412521 wrf-model#1373 "Implicit Explicit Vertical Advection (IEVA)". LIST OF MODIFIED FILES: M dyn_em/module_big_step_utilities_em.F TESTS CONDUCTED: 1. I used a simple Jan 2000 case, with 60 levels, 30-km resolution, and a 20*dx time step. This caused calls to `advect_u_implicit` and `advect_v_implicit` in the first time step. Without the mods, the code generated different results depending on the number of OpenMP threads. With the mods, the results are bit-for-bit for OpenMP with the standard y-only decomposition and with a manual x-only decomposition. Below is a figure of the differences of the V field after the first time step (before the modification). This plot is the difference of the same executable using two different OMP_NUM_THREADS values. After the mod, the results are bit-for-bit. <img width="1152" alt="Screen Shot 2021-03-25 at 4 05 09 PM" src="https://user-images.githubusercontent.com/12666234/112549911-291e8e80-8d84-11eb-8b03-1e1ea50ef731.png"> Before the mods, during the first time step, the following diffs were apparent along the OpenMP boundaries: ``` Diffing np=1/wrfout_d01_2000-01-24_12:10:00 np=6/wrfout_d01_2000-01-24_12:10:00 Next Time 2000-01-24_12:10:00 Field Ndifs Dims RMS (1) RMS (2) DIGITS RMSE pntwise max U 49384 3 0.2158843112E+02 0.2158843113E+02 9 0.5344E-05 0.3589E-05 V 61738 3 0.1834835473E+02 0.1834835712E+02 6 0.1045E-03 0.2183E-03 W 139132 3 0.4977466348E-01 0.4977466098E-01 7 0.3382E-05 0.4809E-03 PH 66955 3 0.2327166773E+04 0.2327166753E+04 8 0.1078E-02 0.7572E-05 T 4838 3 0.7925254902E+02 0.7925254902E+02 12 0.9349E-05 0.2484E-05 THM 4812 3 0.7921679023E+02 0.7921679023E+02 12 0.9289E-05 0.2484E-05 MU 1286 2 0.1460135950E+04 0.1460135956E+04 8 0.1203E-02 0.5148E-05 P 6737 3 0.6512715435E+03 0.6512716390E+03 6 0.2086E-01 0.8162E-03 QVAPOR 26582 3 0.2913825518E-02 0.2913825518E-02 9 0.4536E-09 0.5671E-05 QCLOUD 429 3 0.6474288021E-05 0.6474289263E-05 6 0.3257E-09 0.3024E-03 QICE 715 3 0.4136477606E-05 0.4136463263E-05 5 0.1303E-09 0.1757E-03 QNICE 676 3 0.4164261806E+06 0.4164261805E+06 9 0.1341E+00 0.1125E-05 RAINNC 94 2 0.3158246772E-02 0.3158239178E-02 5 0.9447E-07 0.1558E-03 SNOWNC 94 2 0.3158246772E-02 0.3158239178E-02 5 0.9447E-07 0.1558E-03 SR 1 2 0.3353836226E+00 0.3353836226E+00 9 0.9006E-09 0.5960E-07 ``` 2. Wei successfully tested a separate case with 1x16 and 16x1 OpenMP decompositions, where there were bit-for-bit diffs without the mods. 3. Jenkins tests are all PASS.

Fix bit for bit error in OpenMP IEVA

7361a03

modified: dyn_em/module_big_step_utilities_em.F

davegill added bug Dynamics ARW Develop Branch labels Mar 25, 2021

weiwangncar approved these changes Mar 26, 2021

View reviewed changes

davegill merged commit c25b4d9 into wrf-model:develop Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bit for bit error in OpenMP IEVA #1442

Fix bit for bit error in OpenMP IEVA #1442

davegill commented Mar 25, 2021 •

edited by weiwangncar

Loading

davegill commented Mar 25, 2021

davegill commented Mar 25, 2021

weiwangncar commented Mar 25, 2021

davegill commented Mar 25, 2021

louiswicker commented Mar 25, 2021

davegill commented Mar 26, 2021

louiswicker commented Mar 26, 2021 •

edited

Loading

weiwangncar commented Mar 26, 2021

louiswicker commented Mar 26, 2021 via email

Fix bit for bit error in OpenMP IEVA #1442

Fix bit for bit error in OpenMP IEVA #1442

Conversation

davegill commented Mar 25, 2021 • edited by weiwangncar Loading

davegill commented Mar 25, 2021

davegill commented Mar 25, 2021

weiwangncar commented Mar 25, 2021

davegill commented Mar 25, 2021

louiswicker commented Mar 25, 2021

davegill commented Mar 26, 2021

louiswicker commented Mar 26, 2021 • edited Loading

weiwangncar commented Mar 26, 2021

louiswicker commented Mar 26, 2021 via email

davegill commented Mar 25, 2021 •

edited by weiwangncar

Loading

louiswicker commented Mar 26, 2021 •

edited

Loading