RELAX_COORD_BOUND should be always used and should be removed as an option #1010

edhartnett · 2018-05-31T17:59:34Z

As I have been working on this vars code I have had a chance to look into RELAX_COORD_BOUND.

From configure.ac:

AC_ARG_ENABLE([zero-length-coord-bound],
   [AS_HELP_STRING([--enable-zero-length-coord-bound],
                   [Enable a more relaxed boundary error check NC_EINVALCOORDS
                    to allow coordinate start argument equal to dimension size
                    when argument count is zero. @<:@default: disabled@:>@])],
   [enable_zero_length_coord_bound=${enableval}], [enable_zero_length_coord_bound=no]
)

if test "x$enable_zero_length_coord_bound" = xyes; then
   AC_DEFINE([RELAX_COORD_BOUND], [1], [if true, NC_EINVALCOORDS check is more relaxed])
fi

To summarize the history, the pnetcdf team and other parallel I/O programmers requested a relaxation of coordinate boundary checking to accommodate counts of 0. They submitted a PR that was made an optional change for netCDF at build time. If RELAX_COORD_BOUND is set, then coordinate boundary checking is relaxed enough to accommodate 0 counts.

The problem is, now a user can write code that works on one machine with RELAX_COORD_BOUND, but breaks on another, which does not have RELAX_COORD_BOUND. And that will be very mysterious. It breaks netCDF API compatibility for HPC users.

The reason this was made optional was to maintain compatibility with existing tests, particularly the difficult nc_test program. And I certainly understand the reluctance to attempt changes to that test code. But this is a cure worse than the disease.

The netcdf rules for coordinate boundary checking were written before parallel I/O. With parallel I/O it is quite reasonable to allow some tasks to have counts of 0. The relaxed coordinate boundary rules of pnetcdf are quite reasonable, and fine for sequential netCDF as well.

Since surely no one is depending on getting an error code, we should make RELAX_COORD_BOUND permanent, and modify our too-strict tests.

If no one objects, then I will make the changes as part of my vars work.

Some history:
https://github.com/Unidata/netcdf-c/pull/319/files
#243

The text was updated successfully, but these errors were encountered:

edhartnett · 2018-05-31T18:12:03Z

One clue that the current approach is incorrect is the fact that the build does not work if I use --enable-zero-length-coord-bound.

If an option is added, then tests must be modified in addition to library code, so that tests still pass with the configure option.

(OK. I take that back. The build is only failing because of the new vars functions. Without them, the build does pass with --enable-zero-length-coord-bound and I see that the tests have been modified to accommodate this option.)

edhartnett · 2018-05-31T21:03:05Z

@wkliao and @gsjaardema does pnetcdf offer this option only to remain compatible with netCDF? That is, would pnetcdf prefer to always use relaxed coordinate bounds checking? Or is there some use-case you know of where the stricter checking is necessary?

wkliao · 2018-06-01T04:16:18Z

This feature was developed per Greg's request. It was added to both PnetCDF and NetCDF around the same time. In #243, Greg mentioned a use case.
I am not familiar with the "new vars functions" you mentioned. I assume it is for NetCDF4 files, not classic files. For classic files, I believe both NetCDF and PnetCDF share the same m4 files in nc_test and should run fine for all vars APIs.

edhartnett · 2018-06-01T14:37:35Z

@wkliao in #243 Greg outlines a compelling use case for relaxing the coordinate bound checking.

But there is no use case for not relaxing the bounds checking. It should be relaxed always, not just on some builds.

wkliao · 2018-06-01T16:16:15Z

The decision to make it a configure time option is the back compatibility issue.
There may be some applications expecting NC_EINVALCOORDS or NC_EEDGE.
Greg is in a better position answer your question.

DennisHeimbigner · 2018-06-03T18:29:30Z

Let me propose a set of assertions about this issue.

The netcdf-3 specification
(https://cdn.earthdata.nasa.gov/conduit/upload/496/ESDS-RFC-011v2.00.pdf)
says that dimension lengths must be >= 0, but that only
the record dimension is allowed to have length == 0. [Search for "dim_length"
in the BNF and look at the comment there].
The netcdf 3 format differs from the cdf5 format in that cdf5 non-record dimensions can
have size zero.
If RELAX_COORD_BOUNDS is set, then it can produce a file whose format technically
does not conform to the netcdf 3 standard document.
The user might experience this difference in that the get/put functions would succeed
if RELAX_COORD_BOUNDS is set and zero length dimensions are present, but would
return an error if RELAX_COORD_BOUNDS is not set.
It would be more correct if at run-time zero length dimensions would be allowed
if and only if a cdf5 format file is being read/written.

edhartnett · 2018-06-03T20:14:44Z

Not sure the section of the specification you cite is relevant. Relaxing the coord bound does not change what kind of file can be created. A dimension size of 0 is still not allowed for non-record dimensions.

(It's interesting that cdf5 allows this. However, seems like you could not create a CDF5 file with with netcdf with a non-record dimension of 0 because nc_def_dim() should reject it.)

With relaxed coord bounds, the dimension does not get a length of 0. What is 0 is the count being read/written in an nc_get/put_vara_* call. It changes whether a get/put operation will return error when asked to read/write 0 elements of data, or just return NC_NOERR and do no read/write.

In terms of parallel I/O it is reasonable for some processes to use counts of 0. For collective I/O, it is required to do so, if there is not data to read/write from all processes. The processes that are not reading/writing have to do something (or the whole app will hang), so what they do is a read/write with a count of 0. This causes the netcdf-c code to create an empty HDF5 dataspace on that process, and do the read/write. (It is a perfectly valid operation in HDF5 to do a read/write for 0 elements.) Other processes in the read/write have non-zero counts, and create non-empty dataspaces for thier reads/writes.

It's extremely unlikely to the point of not worth worrying about that a sequential netCDF user would write code that depends on counts of 0 being rejected instead of ignored. What would be the reason for such code?

Furthermore, the fact that the counts of 0 are no longer rejected will not cause any change in the netCDF data operations for existing code. Whether the get/put function returns error or NC_NOERR, no data are read/written, because one of the counts is 0.

Therefore this cannot break any operational user code. If they are currently getting an error (and ignoring it), then after this change they will not get an error, but their file input/output will remain the same. If they are catching the error and handling it, then they now will not have an error to handle, but the file will still remain as they expect.

Coord bounds are going to have to be relaxed for all parallel I/O operations, not just pnetcdf. What we are discussing is whether they should also be relaxed for sequential reads/writes. Does maintaining two methods of checking coord bounds make sense? Seems like we could all switch to relaxed coord bound checking everywhere.

Although the travis runs catch some things, and my CI system catches other things, currently no one is testing with relaxed coord bounds. That's a not ideal, considering that this code is at the heart of all read/write operations.

A configure option doubles the amount of testing that should be done. Relaxing coord bounds is something that should be done always, to help accommodate HPC needs, and minimize testing, development, and maintenance efforts.

Let's eliminate this unneeded option instead of working around it for the next decade. It's doing no good.

DennisHeimbigner · 2018-06-03T20:23:38Z

I just wanted to be clear about the zero dimension issue.
The key problem I have is that I just can not conceive of a legitimate situation in which
one of count dimensions has the value zero. It just makes no sense to me.

DennisHeimbigner · 2018-06-03T20:27:36Z

Correct about the zero dimension size, but I can see no other way in which a zero
count could occur.

... If they are currently getting an error (and ignoring it), then after this change they will
not get an error, but their file input/output will remain the same.
But the point is that getting an error is the important point. It means they user is doing
something wrong. That is very different than having the code silently succeed and
mis-leading the user.

DennisHeimbigner · 2018-06-03T20:33:44Z

To be clear, we can do this without technically breaking our backward compatibility
guarantee, at least as long as we continue to disallow zero size dimensions.
The guarantee is only that we can continue to read all previous files. Unidata
has AFAIK never guaranteed that the API will be backward compatible. It is desirable,
but not required.

edhartnett · 2018-06-03T21:01:04Z

Here is how a zero count can occur.

4 tasks are doing I/O, and they are in collective mode (the default).

They all execute code like this:

nc_create_par(FILE, NC_NETCDF4|NC_MPIIO, comm, info, &ncid)
nc_def_dim(ncid, DIM_NAME, 100, &dimid)
nc_def_var(ncid, VAR_NAME, NC_INT, 1, &dimid, &varid)
nc_enddef(ncid)

So a dim of length 100 has been defined, and a var that uses it. Since metadata is always a collective operation, all 4 tasks must execute the above code.

Now task 0 wants to fill the whole variable's data. Task 0 can do this:

int data[100] = {1, 2, ...};
size_t start[1] = {0}, count[1] = {100};
nc_put_vara_int(ncid, varid, start, count, data);

But the whole application will hang if only task 0 does the nc_put_var_int(), because these are collective operations, which means that all tasks on the communicator must participate. So tasks 1, 2, and 3 also have to do a nc_put_vara_int() call, to the same variable. But they don't have any data to write, so they use a count of 0.

size_t start[1] = {0}, count[1] = {0};
nc_put_vara_int(ncid, varid, start, count, data);

Now every task has called nc_put_vara_int(), as required for a collective operation, and all get NC_NOERR. In practice this will act as an MPI_Barrier() in the code, as all tasks will wait until all other tasks get to the nc_put_vara_int(), and execute it together. Task 0 will do all the work, and the others will just wait. When the collective operation is complete, all tasks will proceed with execution.

So that is why we need counts of 0. For advanced parallel I/O, it's not unusual for only some of the tasks to have data to read/write.

edhartnett · 2018-06-03T21:12:39Z

In terms of backward compatibility, we have indeed promised API compatibility as well as binary format compatibility.

By relaxing the error code checking we do not break API backwards compatibility. Every program that used to work is guaranteed to still work. If we were making the error checking more strict, we would be breaking backward compatibility because working programs would no longer work.

The current state of affairs seems like it's flirting with API incompatibility. A HPC programmer on a system with RELAX_COORD_BOUNDS set may develop some parallel I/O code. Moving to another system (where RELAX_COORD_BOUNDS is not set) will result in getting errors unexpectedly. (And quite confusingly too.)

wkliao · 2018-06-06T07:55:48Z

The netcdf 3 format differs from the cdf5 format in that cdf5 non-record dimensions can
have size zero.

This is incorrect. CDF-5 behaves just like CDF-1 and 2.

@edhartnett, please check the PnetCDF 1.8.0 release note. Both NetCDF and PnetCDF return the error code NC_EINVALCOORDS when the value of start is larger than or equal to the defined dimension size, no matter what count value is. When --enable-relax-coord-bound is used, start can be equal to dimension size only if count is zero. It is not whenever count is zero.

I try to make PnetCDF consistent with NetCDF, including error codes. If NetCDF folks decide to change the default behavior, I can make PnetCDF do the same.

To ease the backward compatibility issue, I plan to create a new PnetCDF hint that can be set at the run time to enable/disable this feature. So users do not have to modify their programs.

DennisHeimbigner · 2018-06-06T17:02:33Z

I stand corrected.

edhartnett · 2018-06-06T17:25:47Z

OK I am not hearing any dissent in terms of making the RELAX_COORD_BOUNDS be always in effect. After we do so, pnetcdf can do the same.

WardF · 2018-06-06T20:29:48Z

I'd like to get input from @czender as well; we've inadvertently broken NCO-based workflows in the past. Assuming no problem there, we can change this behavior.

czender · 2018-06-06T20:47:22Z

Thanks for checking. I do not foresee that this will have any noticeable affect on NCO workflows. I would like to check with NCO "make test" once this patch lands in master. Please ping me then.

edhartnett · 2018-06-07T16:07:43Z

Excellent. Thanks Charlie! ;-)

This change is needed for Dennis' vars performance enhancement to work well with parallel I/O. I will package this with the vars changes.

edhartnett · 2018-06-07T16:27:49Z

Actually there is a little complexity here, so I am going to start by just changing the default so that RELAX_COORD_BOUND is enabled. We will take it one step at a time...

WardF · 2018-06-07T19:05:25Z

Talking with @DennisHeimbigner, we are good with relaxing coordinate bounds by default, but we'd like to retain the option to turn the functionality off, at least in the near term, for debugging purposes if nothing else; hedging our bets against corner cases, so to speak.

edhartnett · 2018-06-07T22:31:55Z

Yes, I agree. We will take it slow. ;-)

edhartnett · 2018-06-22T13:28:12Z

Another aspect of the current configuration that is not ideal is that the setting is changed based on the setting of pnetcdf.

So the user may do --enable-zero-length-coord-bound and then have it disabled during configure, based on the fact that pnetcdf does not have zero-length-coord-bound enabled.

This will lead to confusing errors for the user, who has every reason to believe that the build worked, and gave what was requested, which was a build with zero-length-coord-bound enabled.

I am not going to fix this problem in this case, because we are taking this option out in the next release, so changing this behavior would only cause a lot of user confusion. Both we and the pnetcdf are changing the default value of this option so for a while netcdf users will see pnetcdfs which do not have this option enabled, and then, eventually, it will be enabled everywhere.

When asked to do something, the configure script should either do it, or error out. It should not try to be smarter than the user by ignoring or changing enable options that the user has requested.

wkliao · 2018-06-22T15:59:53Z

I believe the current setting will print a warning message when a conflict is detected, but error out is a better solution.

edhartnett · 2018-11-09T16:26:53Z

OK we have had relaxed coord bounds as the default in master for a few months and no problems.

I suggest that after the 4.6.2 release, I remove this option, as discussed above, and relaxed coord bounds will always be in place, with no option to turn them off.

wkliao · 2018-11-17T19:39:57Z

Is completely removing this option necessary?

edhartnett · 2018-11-17T19:59:26Z

Is there a reason to keep it?

I think we discussed it pretty thoroughly and decided the coord bounds checking should always be relaxed. So what need for an option?

wkliao · 2018-11-17T20:04:43Z

Existing applications using older netcdf versions may still check this error.

edhartnett · 2018-11-18T01:31:55Z

Removing this error cannot affect existing applications. See the discussion above.

wkliao · 2018-11-18T03:58:21Z

App may be designed to follow the old rules and spew NC_EINVALCOORDS as a stop condition.
To me, making RELAX_COORD_BOUND the default is sufficient.

WardF · 2018-11-18T04:11:20Z

No harm is done leaving this option in imo, in the absence of a pressing reason to remove it. In the event we decide to remove it, it would be marked deprecated, for a few releases at least, so that our community (corner cases or no) aren’t blindsided.

edhartnett · 2018-11-18T16:17:16Z

@wkliao that's not possible. Error conditions are being relaxed, not tightened, so the spew of NC_EINVALCOORDS would have occurred without relaxed coord bound checking, not with it. So loosening the checking is not going to cause new errors, and no existing code could be writing data with such calls, because such code would be returning errors.

@WardF the ability to turn off relaxed coord bounds checking is a bug. It needs to be removed.

The reason to remove this option is that it breaks netcdf compatibility across installations. I can write code that works fine on my HPC, but if I move it to a machine where someone has changed this at configure, then that code will break.

So if some poor NOAA programmer is trying to run a model on a workstation, which should work just fine, the model will crash with puzzling IO errors. But there is no reason - the code that works on the HPC will work just fine on the workstation as well, if we don't let hapless installers inadvertently break it.

(Recall that most science users will not compile netCDF, or read the compile documentation. They just use what has been installed. Not only that, they also don't write the I/O code, they just use what is in the model. So the end user will have no clue what is happening.)

NetCDF guarantees API compatibility, and non-relaxed coordinate bounds break that. So relaxed coord bound checking is necessary everywhere. We should not allow the user to turn it off, ever.

edhartnett · 2018-12-03T11:51:34Z

@wkliao I see that pnetcdf 1.9.0 has relaxed coord bound checking disabled by default.

Now that relaxed coord bound checking is enabled by netCDF-4 by default (and soon will be required in all cases) I assume you are going to change the pnetcdf default as well.

wkliao · 2018-12-03T18:14:48Z

In PnetCDF 1.10.0, the default has been changed to enabled. Please check its release note.

I still think completely removing this configure-time option is not necessary, because historically all netCDF documents clearly define the rule for error code NC_EINVALCOORDS. Make this change as NetCDF default behavior has already broken that rule established there for more than 10 years.

When I created the patch, I need to modify nc_test as it is expecting NC_EINVALCOORDS in a few places. Below is one of changes.

netcdf-c/nc_test/test_put.m4

Lines 781 to 787 in 5d8f900

    
           #ifdef RELAX_COORD_BOUND 
        
                       IF (err != NC_NOERR) /* allowed when edge[j]==0 */ 
        
                           EXPECT_ERR(NC_NOERR, err) 
        
           #else 
        
                       IF (err != NC_EINVALCOORDS) /* not allowed even when edge[j]==0 */ 
        
                           EXPECT_ERR(NC_EINVALCOORDS, err) 
        
           #endif

It is possible there are existing applications that use the same practice. I agree with @WardF we should be cautious to avoid breaking existing workflows.

edhartnett · 2019-01-22T17:57:38Z

As agreed I have modified master so that being able to turn off relaxed coordinate bounds checking is deprecated and the user is told not to do it. I will remove the ability to disable relaxed coord bounds after the next release.

Turning off relaxed bounds checking is wrong and a bug. It should not be allowed under any circumstances. As already discussed at length this cannot break existing code, except code that was written to expect an error, which cannot work and effect changes on the file. (That is, it is incorrect netCDF API code).

The nc_test/test_put code should not be used as an example of what a user might do. Nor should it be used as a reason to make or not make changes. The test_put code can, should, and does get changed as we fix bugs and make changes. It also does a lot of things that no user will do.

No user code will exist which probes netCDF error conditions, like nc_test does. I want to know that such code exists in the real world before we start making decisions to protect such code.

Actually I don't get why there is so much argument here. It is clearly a bug to allow two different interpretations of how 0 count is treated. Allowing two interpretations allows the user to create code that works on one HPC installation, but mysteriously fails on another HPC installation. What a pain.

WardF · 2019-01-22T17:58:50Z

Thanks for making the changes.

WardF self-assigned this Jun 6, 2018

WardF added the type/feature request label Jun 6, 2018

WardF modified the milestones: future, 4.7.0 Jun 6, 2018

wkliao mentioned this issue Jul 17, 2018

error out when conflicts with PnetCDF are detected #1070

Merged

edhartnett mentioned this issue Dec 3, 2018

deprecate non-relax coord bounds checking, also fix parallel builds for --enable-pnetcdf #1238

Merged

WardF closed this as completed Jan 23, 2019

edhartnett mentioned this issue Feb 20, 2019

Preparing Maintenance Release #1322

Merged

edhartnett mentioned this issue Mar 8, 2019

removed unrelaxed coord bounds #1354

Closed

jrood-nrel mentioned this issue May 10, 2019

Add enable-relax-coord-bound to parallel-netcdf if version 1.8 or greater spack/spack#11420

Merged

edhartnett mentioned this issue Nov 12, 2019

can we now remove non-relaxed coord bounds? #1519

Closed

ellenjohnson mentioned this issue Feb 24, 2021

Mandatory relaxed coordinate bound checking in 4.7.4? #1951

Closed

RELAX_COORD_BOUND should be always used and should be removed as an option #1010

RELAX_COORD_BOUND should be always used and should be removed as an option #1010

Comments

edhartnett commented May 31, 2018

edhartnett commented May 31, 2018 • edited Loading

edhartnett commented May 31, 2018

wkliao commented Jun 1, 2018

edhartnett commented Jun 1, 2018

wkliao commented Jun 1, 2018

DennisHeimbigner commented Jun 3, 2018

edhartnett commented Jun 3, 2018

DennisHeimbigner commented Jun 3, 2018

DennisHeimbigner commented Jun 3, 2018

DennisHeimbigner commented Jun 3, 2018

edhartnett commented Jun 3, 2018 • edited Loading

edhartnett commented Jun 3, 2018

wkliao commented Jun 6, 2018

DennisHeimbigner commented Jun 6, 2018

edhartnett commented Jun 6, 2018

WardF commented Jun 6, 2018

czender commented Jun 6, 2018

edhartnett commented Jun 7, 2018

edhartnett commented Jun 7, 2018

WardF commented Jun 7, 2018

edhartnett commented Jun 7, 2018

edhartnett commented Jun 22, 2018

wkliao commented Jun 22, 2018

edhartnett commented Nov 9, 2018

wkliao commented Nov 17, 2018

edhartnett commented Nov 17, 2018 • edited Loading

wkliao commented Nov 17, 2018

edhartnett commented Nov 18, 2018

wkliao commented Nov 18, 2018

WardF commented Nov 18, 2018

edhartnett commented Nov 18, 2018 • edited Loading

edhartnett commented Dec 3, 2018 • edited Loading

wkliao commented Dec 3, 2018

edhartnett commented Jan 22, 2019 • edited Loading

WardF commented Jan 22, 2019

edhartnett commented May 31, 2018 •

edited

Loading

edhartnett commented Jun 3, 2018 •

edited

Loading

edhartnett commented Nov 17, 2018 •

edited

Loading

edhartnett commented Nov 18, 2018 •

edited

Loading

edhartnett commented Dec 3, 2018 •

edited

Loading

edhartnett commented Jan 22, 2019 •

edited

Loading