Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal fix for slow nudging with FMS2 #7

Merged
merged 2 commits into from
May 16, 2024
Merged

Conversation

yichengt900
Copy link

@yichengt900 yichengt900 commented May 14, 2024

It has been reported from our seasonal prediction workflow that the current MOM6 nudging cycle is much slower with FMS2:

With FMS1:

                                                       hits          tmin          tmax          tavg                tstd           tfrac    grain pemin pemax
(Ocean sponges)                         48      1.046542      3.557714      1.843401      0.315346  0.008    31             0     1645

With FMS2:

                                                       hits          tmin          tmax          tavg                tstd           tfrac    grain pemin pemax
(Ocean sponges)                         48    197.205940    199.744052    198.060822      0.332912  0.402    31     0  1645

Through deeper investigation, I found that for whatever reason, the get_external_field_info function called by horiz_interp_and_extrap_tracer_fms_id is significantly slower with FMS2. Since the field name and axes size are static, there is no need to re-fetch all this information for every time step.

This PR introduces a new logic first_time in the get_external_field_info function so that it gets axes info only at the first step.

Please note that this is just a temporary fix (which is why I chose to merge it to our dev/cefi branch first); we still need to investigate further to determine the root cause of why the get_external_field_info function is much slower with FMS2. In the meantime, this PR can make the model run with FMS2 at almost the same speed as with FMS1 (only 0.03% slower in a one-year run, see below profiling).

FMS2 with new fix one year run:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   9425.780744   9425.898127   9425.820889      0.036968  1.000     0     0  1645
(Ocean sponges)                      17520    445.104094   1139.163714    580.620263     63.931211  0.062    31     0  1645

FMS1 with new fix one year run:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   9400.495175   9400.592062   9400.525483      0.031271  1.000     0     0  1645
(Ocean sponges)                      17520    446.976279   1156.951049    594.715797     65.370447  0.063    31     0  1645

Once it merges, we can start using the new diag manager and sub-regional output that are offered by FMS2.

@andrew-c-ross
Copy link

Can confirm that this PR solves the slowness in my NWA12 COBALT setup 🥳

@yichengt900
Copy link
Author

Thanks, @andrew-c-ross! I'm glad to hear it solves the slowness issue (temporarily). If the modeled results also make sense (which I believe should be the case because I have 5 years of results and they are identical to the ones with FMS1), it would be great if either of you could approve this PR. That way, I can merge it and start updating our CEFI-regional-MOM6 repo accordingly.

@andrew-c-ross
Copy link

Okay, approved. I wasn't sure if you wanted to merge this or not since it's a temporary fix.

@yichengt900
Copy link
Author

Thanks, @andrew-c-ross. I would prefer to merge this temporary fix into our dev/cefi branch so that we can begin using FMS2's new features in our experiments. I will conduct further investigation to determine the root cause later.

@yichengt900 yichengt900 merged commit ca45eae into dev/cefi May 16, 2024
@andrew-c-ross
Copy link

I think the difference comes from MOM_interp_infra.F90, which is different for FMS1 and 2.

  • FMS2 ultimately uses get_var_axes_info from MOM_io.f90. This opens and closes a netcdf file everytime, and both operations are slow.
  • FMS1 uses get_extern_field_axes from time_interp_external.f90. This does not open or close (or read) netcdf files, as far as I can tell.

@clouden90
Copy link

Thanks @andrew-c-ross . I also found these differences and wonder why FMS2 remove 'get_extern_field_axes' from 'time_interp_external2.f90'.

@yichengt900 yichengt900 deleted the fix/fms2_nudge branch May 24, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants