Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Fix memory management issues by replacing variable length arrays with STL vectors and arrays #3075

Closed
30 of 32 tasks
georgemccabe opened this issue Feb 3, 2025 · 3 comments · Fixed by #3076, #3078, #3079 or #3082
Closed
30 of 32 tasks
Assignees
Labels
MET: PreProcessing Tools (Point) priority: high High Priority requestor: METplus Team METplus Development Team type: bug Fix something that is not working

Comments

@georgemccabe
Copy link
Collaborator

georgemccabe commented Feb 3, 2025

Describe the Problem

Multiple runtime problems were discovered when MET was compiled and tested with the -O2 level of optimization in support of creating a METplus build through Conda. This issue is to resolve these runtime problems related to optimization. It will only be complete when (1) all MET unit tests run to completion with no error and (2) no significant differences remain between the output from MET compiled with optimization and without.

Additional recommendations:

  • Update the Dockerfile to routinely test MET with optimization (GNU -O2) enabled to more closely match how it is run in production.
  • GNU unit testing.
    • Compiled and tested with -O1, -O2, and -O3.
    • Unit tests run to completion and output is compared to main_v12.0 NB20250204 nightly build output.
    • For -O1, 3 files have size diffs > 1% and real diffs are flagged in 1 NetCDF file:
file1: ../../test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
ERROR: found     4 differences in var obs_val                                - max abs: 0.05004883
  • For -O2, same 4 diff files as -O1 plus 15 more related to BCLP derivation in TC-Pairs.
  • For -O3, same 19 diff files as -O2.
  • So not optimizing with FFLAGS did NOT solve the BCLP derivation diffs.
  • Intel unit testing.
    • Compiled and tested with -O1, -O2, and -O3.
    • Note that unit_modis.xml and unit_lidar2nc.xml are skipped because those tools are not compiled.
    • Also note that unit_tc_diag.xml and unit_python.xml are skipped because of issues running Python.
    • unit_point2grid.xml segfaults on the TEST: point2grid_GOES_16_AOD_TO_G212_GAUSSIAN test:
DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M3_G16_s20181341702215_e20181341704588_c20181341711418.nc
terminate called after throwing an instance of 'netCDF::exceptions::NcAttMeta'
  what():  NetCDF: Can't open HDF5 attribute
file: ncCheck.cpp  line:92
FATAL ERROR (SEGFAULT): Process 3478583 got signal 11 @ local time = 2025-02-05 18:09:28Z
  • For -O1, ...
  • For -O2, ...
  • For -O3, ...

The runtime problems, along with their fixes, are described below:

The point2grid_GOES_16_ADP unit test fails to run when installed via conda. I suspect that this is due to optimization being turned on, similar to the bug from #3054.
Here is an example command that fails on seneca:

point2grid \
/d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20192662141196_e20192662143569_c20192662145547.nc \
G212 \
/d1/personal/mccabe/met_test_output/point2grid/point2grid_GOES_16_ADP.nc \
-field name="AOD_Smoke";  level="(*,*)"; \
-adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc \
-goes_qc 0,1 -method MAX -v 1

This produces a segfault when compiled with the -O2 optimization flag.
This problem has been fixed on the bugfix branch for this issue and was due to buffer overflow, reading > 512 characters into a buffer of that size.

  • unit_tc_pairs.xml produces bad output in the TCDIAG lines, where the diagnostic names contains garbage strings.

While tc_pairs completes without error, the unit test validation logic fails when reading the corrupted strings. This was fixed by updating the TrackInfo::diag_name() accessor function to return a string copy of the diagnostic name rather than a const char * pointer to temporary memory.

  • unit_point2grid.xml aborts on the point2grid_2D_time test on line 442 of nc_cf_file.cc. The debugger indicates that the NetCDF groupId value being read is corrupted:
p *_yDim
$31 = {nullObject = false, myId = 0, groupId = 1167840000}

When it should have the same groupId as _xDim (confirmed by debugging the non-optimized version):

p *_xDim
$32 = {nullObject = false, myId = 1, groupId = 131072}

The problem lies in the read_netcdf_grid() function in nc_cf_file.cc where the _xDim and _yDim pointers were pointing to a local variable. The fix is to search for and point them to class variables that won't go out of scope.

  • All unit tests now run to completion when compiled using -g -O2. However, comparing to the output from the unoptimized nightly build runs reveals differences, including differences in the number of TCST lines written by tc_pairs:
file1: test_output/tc_pairs/al022013_interp12_fill.tcst
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/tc_pairs/al022013_interp12_fill.tcst
ERROR: differing number of rows 1162 vs. 1198 for row type TCST_TCMPR between versions 12_0 vs. 12_0 

However, all the differences in these 2 files are for the BCLP model, which is computed using NHC-provided FORTRAN code. So it's possible an optimization problem also resides in the in FORTRAN code within MET. For the time being, I'll retest WITHOUT setting FFLAGS='-O2' to see if differences remain.

  • unit_tc_diag.xml segfaults the same way on Intel builds for all optimization levels. The problem occurs in the Python diagnostic computation step. unit_python.xml also fails on the python_plot_point_obs_met_nc_to_pandas test after completing many successful tests.

These Python failures could very well be related to us linking to this /nrit/ral/met-python3/bin/python3.10 version of Python built with GNU. Perhaps we need to link to an Intel Python build instead? For the time being, I've just skipped over unit_tc_diag.xml and unit_python.xml.

Expected Behavior

All MET unit tests should run to completion and create numerically equivalent output regardless of the optimization level.

Environment

Describe your runtime environment:
Reproduced on the MET development machine named seneca.

To Reproduce

Describe the steps to reproduce the behavior:

  • Compile the MET main_v12.0 branch with CFLAGS, CXXFLAGS, and FFLAGS set to include -O2.
  • Run bin/unit_test.sh to run all of the MET unit tests.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

2702701 to support conda builds.

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Review default alert labels
  • Select component(s)
  • Select priority
  • Select requestor(s)

Milestone and Projects

  • Select Milestone as the next bugfix version
  • Select Coordinated METplus-X.Y Support project for support of the current coordinated release
  • Select MET-X.Y.Z Development project for development toward the next official release

Define Related Issue(s)

Consider the impact to the other METplus components.

Bugfix Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of main_<Version>.
    Branch name: bugfix_<Issue Number>_main_<Version>_<Description>
  • Fix the bug and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into main_<Version>.
    Pull request: bugfix <Issue Number> main_<Version> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next bugfix version
    Select: Coordinated METplus-X.Y Support project for support of the current coordinated release
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Complete the steps above to fix the bug on the develop branch.
    Branch name: bugfix_<Issue Number>_develop_<Description>
    Pull request: bugfix <Issue Number> develop <Description>
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next official version
    Select: MET-X.Y.Z Development project for development toward the next official release
  • Close this issue.
@georgemccabe georgemccabe added alert: NEED ACCOUNT KEY Need to assign an account key to this issue alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle alert: NEED MORE DEFINITION Not yet actionable, additional definition required type: bug Fix something that is not working labels Feb 3, 2025
@JohnHalleyGotway JohnHalleyGotway added requestor: METplus Team METplus Development Team MET: PreProcessing Tools (Point) priority: high High Priority and removed alert: NEED MORE DEFINITION Not yet actionable, additional definition required alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle labels Feb 3, 2025
@github-project-automation github-project-automation bot moved this to 🩺 Needs Triage in METplus-6.1.0 Development Feb 3, 2025
@JohnHalleyGotway JohnHalleyGotway moved this from 🩺 Needs Triage to 🏗 In progress in METplus-6.1.0 Development Feb 3, 2025
@JohnHalleyGotway JohnHalleyGotway moved this from 📖 Backlog to 🏗 In progress in Coordinated METplus-6.0 Support Feb 3, 2025
@JohnHalleyGotway JohnHalleyGotway added this to the MET-12.0.2 (bugfix) milestone Feb 3, 2025
@JohnHalleyGotway JohnHalleyGotway removed the alert: NEED ACCOUNT KEY Need to assign an account key to this issue label Feb 3, 2025
@JohnHalleyGotway
Copy link
Collaborator

This problem is due to a buffer overflow when using a fixed length character array.

Here is the contents of the DQF:flag_meanings NetCDF variable attribute in $MET_TEST_INPUT/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc:

DQF:flag_meanings = "good_smoke_detection_retrieval_qf invalid_smoke_detection_due_to_snow_ice_clouds_or_degraded_source_data_qf good_dust_detection_retrieval_qf invalid_dust_detection_due_to_snow_ice_clouds_or_bad_source_data_qf low_confidence_smoke_detection_qf medium_confidence_smoke_detection_qf high_confidence_smoke_detection_qf low_confidence_dust_detection_qf medium_confidence_dust_detection_qf high_confidence_dust_detection_qf out_of_sun_glint_qf within_sun_glint_qf within_valid_solar_and_satellite_zenith_angle_range_qf outside_valid_solar_or_satellite_zenith_angle_range_qf" ;

That long string consists of 568 characters. However, line 170 of this get_att_value_chars(...) function only uses a buffer with length 512.

So this array overflow is very likely the source of the segfault!

JohnHalleyGotway added a commit that referenced this issue Feb 3, 2025
… attribute value in a string rather than a fixed-length character array for which overflow may occur.
@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Feb 3, 2025

I am testing a fix for this on seneca in /d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid.

  • Updated get_att_value_chars() with this commit to eliminate the fixed-length character array on a bugfix branch.
  • Compiling with -O2 optimization enabled.
  • Running the full set of unit tests to look for additional runtime errors (see /d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid/internal/test_unit/unit_test.log).
  • If all are successful, diff the test_output data with the main_v12.0 nightly build output on seneca.

If all the tests run without error and no differences are flagged, I'll submit a PR to merge this change into the main_v12.0 branch and second one to fix the develop branch.

Note that we could also consider updating this line of the development.docker environment file to replace -g with -O2 so that we're routinely testing the optimized version of the code in which we've found a handful of issues. Also recommend adding a definition for FFLAGS. The upside is that we'd be testing a compilation of MET more similar to the deployed versions, and those tests would likely run somewhat faster. The downside is that we wouldn't be able to debug well within the container, but we haven't actually been doing that in practice.

JohnHalleyGotway added a commit that referenced this issue Feb 3, 2025
…g to using -O2 optimization since that's what how we configure installations on supported platforms. This makes the testing environment more simliar to the deployed versions. And we've found some bugs due to unexpected behavior when compiled with -O2 optimization.
@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Feb 4, 2025

Note that running the unit tests on seneca with -O2 optimization reveals another problem in the ASCII output from TC-Pairs:

/d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid/internal/test_unit/../../share/met/../../bin/tc_pairs \
-adeck /d1/projects/MET/MET_test_data/unit_test/tc_data/adeck/aal092022_OFCL_SHIP_AVNO.dat \
-bdeck /d1/projects/MET/MET_test_data/unit_test/tc_data/bdeck/bal092022.dat \
-diag CIRA_DIAG_RT /d1/projects/MET/MET_test_data/unit_test/tc_data/diag/cira_diag_rt/2022/sal092022_avno_doper_20220926*_diag.dat \
-diag SHIPS_DIAG_RT /d1/projects/MET/MET_test_data/unit_test/tc_data/diag/ships_diag_rt/2022/220926*AL0922_lsdiag.dat \
-config /d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid/internal/test_unit/config/TCPairsConfig_DIAGNOSTICS \
-out /d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid/internal/test_unit/../../test_output/tc_pairs/al092022_20220926_DIAGNOSTICS \
-log /d1/personal/johnhg/MET/MET_development/MET-bugfix_3075_main_v12.0_point2grid/internal/test_unit/../../test_output/tc_pairs/tc_pairs_DIAGNOSTICS.log \
-v 4

The output file contains corrupted diagnostics names, whereas the non-optimized diagnostic names look fine:
Without and with -O2:

< V12.0.1 OFCL   BEST   NA   AL092022 AL    09      IAN        20220926_000000 000000  20220926_000000 NA        NA         TCDIAG       53    27 SHIPS_DIAG_RT  SHIPS_TRK GFS_0p50  3       DTL    311      SHRD      2       PW01          63.9
> V12.0.1 OFCL   BEST   NA   AL092022 AL    09      IAN        20220926_000000 000000  20220926_000000 NA        NA         TCDIAG       53    27 SHIPS_DIAG_RT  SHIPS_TRK GFS_0p50  3   hE V    311   hE V      2       hE V          63.9

@JohnHalleyGotway JohnHalleyGotway changed the title Bugfix: point2grid crashes with optimization Bugfix: Fix remaining runtime issues when MET is compiled with optimization Feb 4, 2025
JohnHalleyGotway added a commit that referenced this issue Feb 4, 2025
…an a pointer to temporary memory to solve the problem with diagnostic names in unit_tc_pairs.xml when compiled with optimization enabled.
JohnHalleyGotway added a commit that referenced this issue Feb 4, 2025
…ers rather than local variables which go out of scope.
JohnHalleyGotway added a commit that referenced this issue Feb 11, 2025
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 11, 2025 that will close this issue
17 tasks
JohnHalleyGotway added a commit that referenced this issue Feb 12, 2025
…ignment operator is used when the == comparison operator is needed.
JohnHalleyGotway added a commit that referenced this issue Feb 12, 2025
* Per #3075, update get_att_value_chars() utility function to store the attribute value in a string rather than a fixed-length character array for which overflow may occur.

* Per #3075, switch from compiling MET in Docker using the -g debug flag to using -O2 optimization since that's what how we configure installations on supported platforms. This makes the testing environment more simliar to the deployed versions. And we've found some bugs due to unexpected behavior when compiled with -O2 optimization.

* Per #3075, remove accidentally committed log file

* Per #3075, update TrackInfo::diag_name() to return a string rather than a pointer to temporary memory to solve the problem with diagnostic names in unit_tc_pairs.xml when compiled with optimization enabled.

* Per #3075, update read_netcdf_logic() to store pointers to class members rather than local variables which go out of scope.

* Per #3075, don't need to use local variables at all.

* Per #3075, switch to using STL vectors for memory management

* Per #3075, reimplement month_name_to_m() with stl strings to avoid variable length arrays.

* Per #3075, update MetNcFile::readFile() to use stl vectors instead of variable length arrays

* Per #3075, update NcCfFile member functions to use stl vectors instead of variable length arrays

* Per #3075, update is_netcdf_file() to use stl vectors instead of variable length arrays

* Per #3075, update 3d_conv.cc to use stl vectors instead of variable length arrays

* Per #3075, update the vx_util library to use stl vectors instead of variable length arrays

* Per #3075, update ensemble_stat to use stl vectors instead of variable length arrays

* Per #3075, update decode_lat_lon() to use stl vectors instead of variable length arrays

* Per #3075, update grid_diag to use stl vectors instead of variable length arrays

* Per #3075, update ioda2nc to use stl vectors instead of variable length arrays

* Per #3075, update madis2nc to use stl vectors instead of variable length arrays

* Per #3075, update mode_graphics to use stl vectors instead of variable length arrays

* Per #3075, update the vx_nc_obs library to use stl vectors instead of variable length arrays

* Per #3075, update plot_point_obs to use stl vectors instead of variable length arrays

* Per #3075, update point_stat to use stl vectors instead of variable length arrays

* Per #3075, update wavelet_stat to use stl vectors instead of variable length arrays

* Per #3075, no real code change, just whitespace

* Per #3075, removing commented out code

* Per #3075, need to add 2 to account for time_count being initialized to -1. An array of length 0 is different from a vector of length 0.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, update looping logic

* Per #3075, eliminate all instances of vector<vector<type>> since it's not stored in contiguous memory and therefore not useful for reading data from the NetCDF files.

* Per #303075, bit more madis2nc changes.

* Per #3075, fix Nx typo

* Per #3075, fix chaNetCDF attribute character type

* Per #3075, minor changes to satisfy SonarQube findings.

* Per #3075, when sizing vectors of type <char> add one for the trailing null.

* Per #3075, remove debugging code.

* Per #3075, unit_ioda2nc.xml fails when compiled with Intel since there are issues parsing NC_STRING attribute types. Reverting back to the previous logic from main_v12.0 since that works.

* Per #3750, back out the change to using -O2 in development.docker. With it, differences on flagged by GHA. I'd like to make sure all the changes on this branch cause NO differences before switching to using -O2... most likely in the develop branch rather than main_v12.0.

* Per #3075, getting segfault from point2grid. Null terminating character vectors after reading NetCDF attributes just to be safe.

* Unrelated to #3075, only whitespace changes.

* Per #3075, fix logic of the write_nc(...) function so that all variable attributes and added and defined prior to writing the data for that variable. Writing attributes AFTER the data, as we had been doing, causes unexpected failures, as found when compiled with Intel.

* Per #3075, update args to write_nc(...) to minimize regression test diffs.

* Per #3075, fix madis2nc i_buf definition problem.

* Per #3075, more refinement of i_buf definition in madis2nc for acars and raob inputs.

* Per #3075, remove FFLAGS from development.docker becuase there's no good reason to add it.

* replace raw array with std array

* replace raw array with std array in 2 other places

* Per #3075, fix clear bug in vx_bool_calc/tokenizer.cc where the = assignment operator is used when the == comparison operator is needed.

---------

Co-authored-by: George McCabe <[email protected]>
@github-project-automation github-project-automation bot moved this from 🏗 In progress to 🏁 Done in METplus-6.1.0 Development Feb 12, 2025
@github-project-automation github-project-automation bot moved this from 🏗 In progress to 🏁 Done in Coordinated METplus-6.0 Support Feb 12, 2025
JohnHalleyGotway added a commit that referenced this issue Feb 12, 2025
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 13, 2025 that will close this issue
17 tasks
@JohnHalleyGotway JohnHalleyGotway moved this from 🏁 Done to 🔎 In review in METplus-6.1.0 Development Feb 13, 2025
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 13, 2025 that will close this issue
JohnHalleyGotway added a commit that referenced this issue Feb 13, 2025
* use custom GitHub Action to trigger METplus use cases

* Updating values

* Bugfix #3020 main_v12.0 grid_stat_seeps (#3022)

* Per #3020, add missing GridStatNcOutInfo::do_seeps flag and use it to determine if SEEPS information should be written to the Grid-Stat NetCDF matched pairs output file.

* Unrelated to #3020, fix broken NetCDF cf-conventions links in the User's Guide.

* Per #3020, no real changes. Just whitespace

* Update to reflect usage of oneAPI compilers

* Updating file to reflect usage of oneAPI compilers

* Hotfix to the main_v12.0 branch after PR #3022 fixed a SEEPS bug. The GridStatConfig_SEEPS config file needs to be updated with nc_pairs_flag.seeps = TRUE in order for the same output to be produced by the unit tests.

* Adding In Memoriam

* Feature #3032 main_v12.0 docs data type (#3040)

* Per #3032, add data type column to all of the output tables

* Per #3032, remove the first row from each output table since its info is repeated from the table name. Additional changes for consistency and accuracy in column names.

* Update docs/Users_Guide/gsi-tools.rst

Co-authored-by: Julie Prestopnik <[email protected]>

---------

Co-authored-by: Julie Prestopnik <[email protected]>

* Making a superficial change in the main_v12.0 branch to trigger GHA to create and push an updated test output image.

* Feature #3033 v12.0.0 (#3042)

* Per #3033, update version info, consolidate release notes, and add upgrade instructions.

* Per #3033, remove all instances of 'Bugfix: ' from the release notes since it's redundant with the dropdown name

* Per #3030, based on request from Randy Pierce, also add MTD header columns to met_header_columns_v12.0.txt to make it easier to parse the output from MET.

* Per #3033, fix typo and correct alignment in table

* Update install_met_env.acorn

Removing reference to beta version

* Update install_met_env.cactus

Remove references to beta version

* Update install_met_env.cactus

Update paths for eckit and atlas

* Update install_met_env.wcoss2

Remove beta references

* Fix typo, missing one * to make SciPy bold in appendixF.rst

* Per #3051, update unit tests so that installed files are found relative to MET_BASE (<install_loc>/share/met) and other files that are only in the MET repo are found relative to MET_TEST_BASE (MET/internal/test_unit). Also remove MET_BUILD_BASE env var (#3052)

* Bugfix #3054 main_v12.0 parusr (#3068)

* Per #3054, fix PARUSR BUFRLIB error by solving the upstream reference to temporary memory returned by c_str(). Store a copy of the temporary variable name in a string rather than a pointer to temporary memory. Note that I checked all other calls to c_str() in pb2nc.cc and found these 2 instances to be only problematic ones. All others are used as arguments to functions for which a copy is made.

* Unrelated to #3054, but discovered while investigating the dtcenter/METplus#2875 discussion, the PairBase::calc_obs_summary() function loops over map entries and attempts to update the mapped 'summary_val' value. However, the current version only updates it in a copy and not what's actually in the map. This changes how we loop over the map to actually udpate its contents. Note that the only impact is fixing a log file to accurately report the 'summary_val'. So this is really a logging bug.

* Per #3054, revert emplace_back() to its original push_back() to make the bugfix diffs as limited as possible.

* Per #3054, correct bugfix in PairBase::calc_obs_summary() in pair_base.cc

---------

Co-authored-by: MET Tools Test Account <[email protected]>

* Per #3070, updates for the 12.0.1 bugfix release. (#3071)

* Updating file for 12.0.1 installation for NCO

* Updating to 12.0.1 for NCO

* Update and rename 12.0.0_acorn to 12.0.1_acorn for NCO

* Rename 12.0.0.lua_wcoss2 to 12.0.1.lua_wcoss2 for NCO

* Update 12.0.0_hercules

* Update install_met_env.hercules

* Update compiler and MET version in install_met_env.orion

* Update  compiler and MET version in 12.0.0_orion

* Bugfix #3075 main_v12.0 optimization (#3076)

* Per #3075, update get_att_value_chars() utility function to store the attribute value in a string rather than a fixed-length character array for which overflow may occur.

* Per #3075, switch from compiling MET in Docker using the -g debug flag to using -O2 optimization since that's what how we configure installations on supported platforms. This makes the testing environment more simliar to the deployed versions. And we've found some bugs due to unexpected behavior when compiled with -O2 optimization.

* Per #3075, remove accidentally committed log file

* Per #3075, update TrackInfo::diag_name() to return a string rather than a pointer to temporary memory to solve the problem with diagnostic names in unit_tc_pairs.xml when compiled with optimization enabled.

* Per #3075, update read_netcdf_logic() to store pointers to class members rather than local variables which go out of scope.

* Per #3075, don't need to use local variables at all.

* Per #3075, switch to using STL vectors for memory management

* Per #3075, reimplement month_name_to_m() with stl strings to avoid variable length arrays.

* Per #3075, update MetNcFile::readFile() to use stl vectors instead of variable length arrays

* Per #3075, update NcCfFile member functions to use stl vectors instead of variable length arrays

* Per #3075, update is_netcdf_file() to use stl vectors instead of variable length arrays

* Per #3075, update 3d_conv.cc to use stl vectors instead of variable length arrays

* Per #3075, update the vx_util library to use stl vectors instead of variable length arrays

* Per #3075, update ensemble_stat to use stl vectors instead of variable length arrays

* Per #3075, update decode_lat_lon() to use stl vectors instead of variable length arrays

* Per #3075, update grid_diag to use stl vectors instead of variable length arrays

* Per #3075, update ioda2nc to use stl vectors instead of variable length arrays

* Per #3075, update madis2nc to use stl vectors instead of variable length arrays

* Per #3075, update mode_graphics to use stl vectors instead of variable length arrays

* Per #3075, update the vx_nc_obs library to use stl vectors instead of variable length arrays

* Per #3075, update plot_point_obs to use stl vectors instead of variable length arrays

* Per #3075, update point_stat to use stl vectors instead of variable length arrays

* Per #3075, update wavelet_stat to use stl vectors instead of variable length arrays

* Per #3075, no real code change, just whitespace

* Per #3075, removing commented out code

* Per #3075, need to add 2 to account for time_count being initialized to -1. An array of length 0 is different from a vector of length 0.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, update looping logic

* Per #3075, eliminate all instances of vector<vector<type>> since it's not stored in contiguous memory and therefore not useful for reading data from the NetCDF files.

* Per #303075, bit more madis2nc changes.

* Per #3075, fix Nx typo

* Per #3075, fix chaNetCDF attribute character type

* Per #3075, minor changes to satisfy SonarQube findings.

* Per #3075, when sizing vectors of type <char> add one for the trailing null.

* Per #3075, remove debugging code.

* Per #3075, unit_ioda2nc.xml fails when compiled with Intel since there are issues parsing NC_STRING attribute types. Reverting back to the previous logic from main_v12.0 since that works.

* Per #3750, back out the change to using -O2 in development.docker. With it, differences on flagged by GHA. I'd like to make sure all the changes on this branch cause NO differences before switching to using -O2... most likely in the develop branch rather than main_v12.0.

* Per #3075, getting segfault from point2grid. Null terminating character vectors after reading NetCDF attributes just to be safe.

* Unrelated to #3075, only whitespace changes.

* Per #3075, fix logic of the write_nc(...) function so that all variable attributes and added and defined prior to writing the data for that variable. Writing attributes AFTER the data, as we had been doing, causes unexpected failures, as found when compiled with Intel.

* Per #3075, update args to write_nc(...) to minimize regression test diffs.

* Per #3075, fix madis2nc i_buf definition problem.

* Per #3075, more refinement of i_buf definition in madis2nc for acars and raob inputs.

* Per #3075, remove FFLAGS from development.docker becuase there's no good reason to add it.

* replace raw array with std array

* replace raw array with std array in 2 other places

* Per #3075, fix clear bug in vx_bool_calc/tokenizer.cc where the = assignment operator is used when the == comparison operator is needed.

---------

Co-authored-by: George McCabe <[email protected]>

---------

Co-authored-by: George McCabe <[email protected]>
Co-authored-by: Julie Prestopnik <[email protected]>
Co-authored-by: John Halley Gotway <[email protected]>
Co-authored-by: MET Tools Test Account <[email protected]>
Co-authored-by: metplus-bot <[email protected]>
JohnHalleyGotway added a commit that referenced this issue Feb 13, 2025
…ed in the #3076 pull request. The logic for converting to camel case was in the wrong order and the details of this are described in this comment: #3078 (comment)
JohnHalleyGotway added a commit that referenced this issue Feb 13, 2025
* Per #3075, port changes from PR #3076 over for the develop branch.

* Per #3075, update ioda.cc to switch from using variable length arrays to using STL vectors.

* Per #3075, correct logic in unit_to_mdyhms.cc for switching a string to camel case.
@JohnHalleyGotway JohnHalleyGotway moved this from 🔎 In review to 🏁 Done in METplus-6.1.0 Development Feb 13, 2025
@JohnHalleyGotway JohnHalleyGotway linked a pull request Feb 13, 2025 that will close this issue
JohnHalleyGotway added a commit that referenced this issue Feb 14, 2025
…owing for an empty input in the statement_list.
JohnHalleyGotway added a commit that referenced this issue Feb 14, 2025
* Per #3077, back out the yystate patching logic added to Makefile.am for MET #2408 to allow for the parsing of empty configuration file. This has caused a 'shift/reduce conflict' warning message from the Intel compiler, is flagged as problem by '-fsanitize=address', and very well may be causing the sporadic yyerror failures described in #3077. Recommend testing with this change to see if the yyerror's go away, but also re-testing MET #2408 to assess the handling of empty configuration files.

* Per #3075, update config.tab.yy/.cc to allow for empty inputs but allowing for an empty input in the statement_list.
@JohnHalleyGotway JohnHalleyGotway changed the title Bugfix: Fix remaining runtime issues when MET is compiled with optimization Bugfix: Fix memory management issues by replacing variable length arrays with STL vectors and arrays Feb 14, 2025
JohnHalleyGotway added a commit that referenced this issue Feb 14, 2025
* Per #3081, roll the MET version number from 12.0.1 to 12.0.2 and add release notes.

* Per #3081, improve the title for issue #3075
JohnHalleyGotway added a commit that referenced this issue Feb 15, 2025
* use custom GitHub Action to trigger METplus use cases

* Updating values

* Bugfix #3020 main_v12.0 grid_stat_seeps (#3022)

* Per #3020, add missing GridStatNcOutInfo::do_seeps flag and use it to determine if SEEPS information should be written to the Grid-Stat NetCDF matched pairs output file.

* Unrelated to #3020, fix broken NetCDF cf-conventions links in the User's Guide.

* Per #3020, no real changes. Just whitespace

* Update to reflect usage of oneAPI compilers

* Updating file to reflect usage of oneAPI compilers

* Hotfix to the main_v12.0 branch after PR #3022 fixed a SEEPS bug. The GridStatConfig_SEEPS config file needs to be updated with nc_pairs_flag.seeps = TRUE in order for the same output to be produced by the unit tests.

* Adding In Memoriam

* Feature #3032 main_v12.0 docs data type (#3040)

* Per #3032, add data type column to all of the output tables

* Per #3032, remove the first row from each output table since its info is repeated from the table name. Additional changes for consistency and accuracy in column names.

* Update docs/Users_Guide/gsi-tools.rst

Co-authored-by: Julie Prestopnik <[email protected]>

---------

Co-authored-by: Julie Prestopnik <[email protected]>

* Making a superficial change in the main_v12.0 branch to trigger GHA to create and push an updated test output image.

* Feature #3033 v12.0.0 (#3042)

* Per #3033, update version info, consolidate release notes, and add upgrade instructions.

* Per #3033, remove all instances of 'Bugfix: ' from the release notes since it's redundant with the dropdown name

* Per #3030, based on request from Randy Pierce, also add MTD header columns to met_header_columns_v12.0.txt to make it easier to parse the output from MET.

* Per #3033, fix typo and correct alignment in table

* Update install_met_env.acorn

Removing reference to beta version

* Update install_met_env.cactus

Remove references to beta version

* Update install_met_env.cactus

Update paths for eckit and atlas

* Update install_met_env.wcoss2

Remove beta references

* Fix typo, missing one * to make SciPy bold in appendixF.rst

* Per #3051, update unit tests so that installed files are found relative to MET_BASE (<install_loc>/share/met) and other files that are only in the MET repo are found relative to MET_TEST_BASE (MET/internal/test_unit). Also remove MET_BUILD_BASE env var (#3052)

* Bugfix #3054 main_v12.0 parusr (#3068)

* Per #3054, fix PARUSR BUFRLIB error by solving the upstream reference to temporary memory returned by c_str(). Store a copy of the temporary variable name in a string rather than a pointer to temporary memory. Note that I checked all other calls to c_str() in pb2nc.cc and found these 2 instances to be only problematic ones. All others are used as arguments to functions for which a copy is made.

* Unrelated to #3054, but discovered while investigating the dtcenter/METplus#2875 discussion, the PairBase::calc_obs_summary() function loops over map entries and attempts to update the mapped 'summary_val' value. However, the current version only updates it in a copy and not what's actually in the map. This changes how we loop over the map to actually udpate its contents. Note that the only impact is fixing a log file to accurately report the 'summary_val'. So this is really a logging bug.

* Per #3054, revert emplace_back() to its original push_back() to make the bugfix diffs as limited as possible.

* Per #3054, correct bugfix in PairBase::calc_obs_summary() in pair_base.cc

---------

Co-authored-by: MET Tools Test Account <[email protected]>

* Per #3070, updates for the 12.0.1 bugfix release. (#3071)

* Updating file for 12.0.1 installation for NCO

* Updating to 12.0.1 for NCO

* Update and rename 12.0.0_acorn to 12.0.1_acorn for NCO

* Rename 12.0.0.lua_wcoss2 to 12.0.1.lua_wcoss2 for NCO

* Update 12.0.0_hercules

* Update install_met_env.hercules

* Update compiler and MET version in install_met_env.orion

* Update  compiler and MET version in 12.0.0_orion

* Bugfix #3075 main_v12.0 optimization (#3076)

* Per #3075, update get_att_value_chars() utility function to store the attribute value in a string rather than a fixed-length character array for which overflow may occur.

* Per #3075, switch from compiling MET in Docker using the -g debug flag to using -O2 optimization since that's what how we configure installations on supported platforms. This makes the testing environment more simliar to the deployed versions. And we've found some bugs due to unexpected behavior when compiled with -O2 optimization.

* Per #3075, remove accidentally committed log file

* Per #3075, update TrackInfo::diag_name() to return a string rather than a pointer to temporary memory to solve the problem with diagnostic names in unit_tc_pairs.xml when compiled with optimization enabled.

* Per #3075, update read_netcdf_logic() to store pointers to class members rather than local variables which go out of scope.

* Per #3075, don't need to use local variables at all.

* Per #3075, switch to using STL vectors for memory management

* Per #3075, reimplement month_name_to_m() with stl strings to avoid variable length arrays.

* Per #3075, update MetNcFile::readFile() to use stl vectors instead of variable length arrays

* Per #3075, update NcCfFile member functions to use stl vectors instead of variable length arrays

* Per #3075, update is_netcdf_file() to use stl vectors instead of variable length arrays

* Per #3075, update 3d_conv.cc to use stl vectors instead of variable length arrays

* Per #3075, update the vx_util library to use stl vectors instead of variable length arrays

* Per #3075, update ensemble_stat to use stl vectors instead of variable length arrays

* Per #3075, update decode_lat_lon() to use stl vectors instead of variable length arrays

* Per #3075, update grid_diag to use stl vectors instead of variable length arrays

* Per #3075, update ioda2nc to use stl vectors instead of variable length arrays

* Per #3075, update madis2nc to use stl vectors instead of variable length arrays

* Per #3075, update mode_graphics to use stl vectors instead of variable length arrays

* Per #3075, update the vx_nc_obs library to use stl vectors instead of variable length arrays

* Per #3075, update plot_point_obs to use stl vectors instead of variable length arrays

* Per #3075, update point_stat to use stl vectors instead of variable length arrays

* Per #3075, update wavelet_stat to use stl vectors instead of variable length arrays

* Per #3075, no real code change, just whitespace

* Per #3075, removing commented out code

* Per #3075, need to add 2 to account for time_count being initialized to -1. An array of length 0 is different from a vector of length 0.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, can't use 2D vectors to read data from NetCDF files into a contiguous block of memory.

* Per #3075, update looping logic

* Per #3075, eliminate all instances of vector<vector<type>> since it's not stored in contiguous memory and therefore not useful for reading data from the NetCDF files.

* Per #303075, bit more madis2nc changes.

* Per #3075, fix Nx typo

* Per #3075, fix chaNetCDF attribute character type

* Per #3075, minor changes to satisfy SonarQube findings.

* Per #3075, when sizing vectors of type <char> add one for the trailing null.

* Per #3075, remove debugging code.

* Per #3075, unit_ioda2nc.xml fails when compiled with Intel since there are issues parsing NC_STRING attribute types. Reverting back to the previous logic from main_v12.0 since that works.

* Per #3750, back out the change to using -O2 in development.docker. With it, differences on flagged by GHA. I'd like to make sure all the changes on this branch cause NO differences before switching to using -O2... most likely in the develop branch rather than main_v12.0.

* Per #3075, getting segfault from point2grid. Null terminating character vectors after reading NetCDF attributes just to be safe.

* Unrelated to #3075, only whitespace changes.

* Per #3075, fix logic of the write_nc(...) function so that all variable attributes and added and defined prior to writing the data for that variable. Writing attributes AFTER the data, as we had been doing, causes unexpected failures, as found when compiled with Intel.

* Per #3075, update args to write_nc(...) to minimize regression test diffs.

* Per #3075, fix madis2nc i_buf definition problem.

* Per #3075, more refinement of i_buf definition in madis2nc for acars and raob inputs.

* Per #3075, remove FFLAGS from development.docker becuase there's no good reason to add it.

* replace raw array with std array

* replace raw array with std array in 2 other places

* Per #3075, fix clear bug in vx_bool_calc/tokenizer.cc where the = assignment operator is used when the == comparison operator is needed.

---------

Co-authored-by: George McCabe <[email protected]>

* Per #3075, adding a hotfix to main_v12.0 that should have been included in the #3076 pull request. The logic for converting to camel case was in the wrong order and the details of this are described in this comment: #3078 (comment)

* Bugfix #3077 main_v12.0 yyerror (#3083)

* Per #3077, back out the yystate patching logic added to Makefile.am for MET #2408 to allow for the parsing of empty configuration file. This has caused a 'shift/reduce conflict' warning message from the Intel compiler, is flagged as problem by '-fsanitize=address', and very well may be causing the sporadic yyerror failures described in #3077. Recommend testing with this change to see if the yyerror's go away, but also re-testing MET #2408 to assess the handling of empty configuration files.

* Per #3075, update config.tab.yy/.cc to allow for empty inputs but allowing for an empty input in the statement_list.

* Feature #3081 v12.0.2 (#3085)

* Per #3081, roll the MET version number from 12.0.1 to 12.0.2 and add release notes.

* Per #3081, improve the title for issue #3075

---------

Co-authored-by: George McCabe <[email protected]>
Co-authored-by: Julie Prestopnik <[email protected]>
Co-authored-by: John Halley Gotway <[email protected]>
Co-authored-by: MET Tools Test Account <[email protected]>
Co-authored-by: metplus-bot <[email protected]>
JohnHalleyGotway added a commit that referenced this issue Feb 20, 2025
* Per #3087, update logic in VarInfoNcMet::set_magic(...) to actuall store the requested level string to allow for discriminating between multiple U/V vertical level matches.

* Unrelated to #3087, delete unneeded 'int errno;' local variable from temp_file.cc that caused an unexpected copmilation error with GCC 9.4.0 on Ubuntu as described in the dtcenter/METplus#2897 discussion.

* Per #3087, tweak logic to handle '*' and fix resolve regression test differences.

* Per #3087, add regrid_data_plane and grid_stat unit test to demostrate creating vector pairs at multiple levels from NetCDF input files.

* Per #3087, forgot to add the Grid-Stat config file needed for the new unit test.

* Per #3075, refine name and logic the new tests.

* Per #3087, update ConcatString class to simplify from a pointer to a string to just a string itself. This is based on SonarQube code smells, but the implementation is much simpler and easier to maintain.

* Per #3087, drive down a few more SonarQube code smells.

* Per #3087, back out the ConcatString changes to switch from enum to enum class and the use of explicit since those had huge and wide-ranging impacts. Touching that many files is not worth it to reduce SonarQube code smells.

* Per #3087, modify the existing point2grid_pb2nc_big_input test in unit_point2grid.xml by switching from requesting the 'Z2' level to using '*', like all the other simliar point2grid tests. Note that I DID actually test to confirm that 'Z2' and '*' produce the same result. So specifying Z2 does NOT actually filter the obs data as you'd expect it would. With this change, the diff of the output from the test should go away for PR #3088.

* Unrelated to #3087, but pointed out by @j-opatz, removing an outdated line from the Ensemble-Stat chapter of the MET User's Guide referencing the 'ens' dictionary which was removed at the same time Gen-Ens-Prod was created.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment