Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RELAX_COORD_BOUND should be always used and should be removed as an option #1010

Closed
edhartnett opened this issue May 31, 2018 · 35 comments
Closed
Assignees
Milestone

Comments

@edhartnett
Copy link
Contributor

As I have been working on this vars code I have had a chance to look into RELAX_COORD_BOUND.

From configure.ac:

AC_ARG_ENABLE([zero-length-coord-bound],
   [AS_HELP_STRING([--enable-zero-length-coord-bound],
                   [Enable a more relaxed boundary error check NC_EINVALCOORDS
                    to allow coordinate start argument equal to dimension size
                    when argument count is zero. @<:@default: disabled@:>@])],
   [enable_zero_length_coord_bound=${enableval}], [enable_zero_length_coord_bound=no]
)

if test "x$enable_zero_length_coord_bound" = xyes; then
   AC_DEFINE([RELAX_COORD_BOUND], [1], [if true, NC_EINVALCOORDS check is more relaxed])
fi

To summarize the history, the pnetcdf team and other parallel I/O programmers requested a relaxation of coordinate boundary checking to accommodate counts of 0. They submitted a PR that was made an optional change for netCDF at build time. If RELAX_COORD_BOUND is set, then coordinate boundary checking is relaxed enough to accommodate 0 counts.

The problem is, now a user can write code that works on one machine with RELAX_COORD_BOUND, but breaks on another, which does not have RELAX_COORD_BOUND. And that will be very mysterious. It breaks netCDF API compatibility for HPC users.

The reason this was made optional was to maintain compatibility with existing tests, particularly the difficult nc_test program. And I certainly understand the reluctance to attempt changes to that test code. But this is a cure worse than the disease.

The netcdf rules for coordinate boundary checking were written before parallel I/O. With parallel I/O it is quite reasonable to allow some tasks to have counts of 0. The relaxed coordinate boundary rules of pnetcdf are quite reasonable, and fine for sequential netCDF as well.

Since surely no one is depending on getting an error code, we should make RELAX_COORD_BOUND permanent, and modify our too-strict tests.

If no one objects, then I will make the changes as part of my vars work.

Some history:
https://github.com/Unidata/netcdf-c/pull/319/files
#243

@edhartnett
Copy link
Contributor Author

edhartnett commented May 31, 2018

One clue that the current approach is incorrect is the fact that the build does not work if I use --enable-zero-length-coord-bound.

If an option is added, then tests must be modified in addition to library code, so that tests still pass with the configure option.

(OK. I take that back. The build is only failing because of the new vars functions. Without them, the build does pass with --enable-zero-length-coord-bound and I see that the tests have been modified to accommodate this option.)

@edhartnett
Copy link
Contributor Author

@wkliao and @gsjaardema does pnetcdf offer this option only to remain compatible with netCDF? That is, would pnetcdf prefer to always use relaxed coordinate bounds checking? Or is there some use-case you know of where the stricter checking is necessary?

@wkliao
Copy link
Contributor

wkliao commented Jun 1, 2018

This feature was developed per Greg's request. It was added to both PnetCDF and NetCDF around the same time. In #243, Greg mentioned a use case.
I am not familiar with the "new vars functions" you mentioned. I assume it is for NetCDF4 files, not classic files. For classic files, I believe both NetCDF and PnetCDF share the same m4 files in nc_test and should run fine for all vars APIs.

@edhartnett
Copy link
Contributor Author

@wkliao in #243 Greg outlines a compelling use case for relaxing the coordinate bound checking.

But there is no use case for not relaxing the bounds checking. It should be relaxed always, not just on some builds.

@wkliao
Copy link
Contributor

wkliao commented Jun 1, 2018

The decision to make it a configure time option is the back compatibility issue.
There may be some applications expecting NC_EINVALCOORDS or NC_EEDGE.
Greg is in a better position answer your question.

@DennisHeimbigner
Copy link
Collaborator

Let me propose a set of assertions about this issue.

  1. The netcdf-3 specification
    (https://cdn.earthdata.nasa.gov/conduit/upload/496/ESDS-RFC-011v2.00.pdf)
    says that dimension lengths must be >= 0, but that only
    the record dimension is allowed to have length == 0. [Search for "dim_length"
    in the BNF and look at the comment there].
  2. The netcdf 3 format differs from the cdf5 format in that cdf5 non-record dimensions can
    have size zero.
  3. If RELAX_COORD_BOUNDS is set, then it can produce a file whose format technically
    does not conform to the netcdf 3 standard document.
  4. The user might experience this difference in that the get/put functions would succeed
    if RELAX_COORD_BOUNDS is set and zero length dimensions are present, but would
    return an error if RELAX_COORD_BOUNDS is not set.
  5. It would be more correct if at run-time zero length dimensions would be allowed
    if and only if a cdf5 format file is being read/written.

@edhartnett
Copy link
Contributor Author

Not sure the section of the specification you cite is relevant. Relaxing the coord bound does not change what kind of file can be created. A dimension size of 0 is still not allowed for non-record dimensions.

(It's interesting that cdf5 allows this. However, seems like you could not create a CDF5 file with with netcdf with a non-record dimension of 0 because nc_def_dim() should reject it.)

With relaxed coord bounds, the dimension does not get a length of 0. What is 0 is the count being read/written in an nc_get/put_vara_* call. It changes whether a get/put operation will return error when asked to read/write 0 elements of data, or just return NC_NOERR and do no read/write.

In terms of parallel I/O it is reasonable for some processes to use counts of 0. For collective I/O, it is required to do so, if there is not data to read/write from all processes. The processes that are not reading/writing have to do something (or the whole app will hang), so what they do is a read/write with a count of 0. This causes the netcdf-c code to create an empty HDF5 dataspace on that process, and do the read/write. (It is a perfectly valid operation in HDF5 to do a read/write for 0 elements.) Other processes in the read/write have non-zero counts, and create non-empty dataspaces for thier reads/writes.

It's extremely unlikely to the point of not worth worrying about that a sequential netCDF user would write code that depends on counts of 0 being rejected instead of ignored. What would be the reason for such code?

Furthermore, the fact that the counts of 0 are no longer rejected will not cause any change in the netCDF data operations for existing code. Whether the get/put function returns error or NC_NOERR, no data are read/written, because one of the counts is 0.

Therefore this cannot break any operational user code. If they are currently getting an error (and ignoring it), then after this change they will not get an error, but their file input/output will remain the same. If they are catching the error and handling it, then they now will not have an error to handle, but the file will still remain as they expect.

Coord bounds are going to have to be relaxed for all parallel I/O operations, not just pnetcdf. What we are discussing is whether they should also be relaxed for sequential reads/writes. Does maintaining two methods of checking coord bounds make sense? Seems like we could all switch to relaxed coord bound checking everywhere.

Although the travis runs catch some things, and my CI system catches other things, currently no one is testing with relaxed coord bounds. That's a not ideal, considering that this code is at the heart of all read/write operations.

A configure option doubles the amount of testing that should be done. Relaxing coord bounds is something that should be done always, to help accommodate HPC needs, and minimize testing, development, and maintenance efforts.

Let's eliminate this unneeded option instead of working around it for the next decade. It's doing no good.

@DennisHeimbigner
Copy link
Collaborator

I just wanted to be clear about the zero dimension issue.
The key problem I have is that I just can not conceive of a legitimate situation in which
one of count dimensions has the value zero. It just makes no sense to me.

@DennisHeimbigner
Copy link
Collaborator

Correct about the zero dimension size, but I can see no other way in which a zero
count could occur.

... If they are currently getting an error (and ignoring it), then after this change they will
not get an error, but their file input/output will remain the same.
But the point is that getting an error is the important point. It means they user is doing
something wrong. That is very different than having the code silently succeed and
mis-leading the user.

@DennisHeimbigner
Copy link
Collaborator

To be clear, we can do this without technically breaking our backward compatibility
guarantee, at least as long as we continue to disallow zero size dimensions.
The guarantee is only that we can continue to read all previous files. Unidata
has AFAIK never guaranteed that the API will be backward compatible. It is desirable,
but not required.

@edhartnett
Copy link
Contributor Author

edhartnett commented Jun 3, 2018

Here is how a zero count can occur.

4 tasks are doing I/O, and they are in collective mode (the default).

They all execute code like this:

nc_create_par(FILE, NC_NETCDF4|NC_MPIIO, comm, info, &ncid)
nc_def_dim(ncid, DIM_NAME, 100, &dimid)
nc_def_var(ncid, VAR_NAME, NC_INT, 1, &dimid, &varid)
nc_enddef(ncid)

So a dim of length 100 has been defined, and a var that uses it. Since metadata is always a collective operation, all 4 tasks must execute the above code.

Now task 0 wants to fill the whole variable's data. Task 0 can do this:

int data[100] = {1, 2, ...};
size_t start[1] = {0}, count[1] = {100};
nc_put_vara_int(ncid, varid, start, count, data);

But the whole application will hang if only task 0 does the nc_put_var_int(), because these are collective operations, which means that all tasks on the communicator must participate. So tasks 1, 2, and 3 also have to do a nc_put_vara_int() call, to the same variable. But they don't have any data to write, so they use a count of 0.

size_t start[1] = {0}, count[1] = {0};
nc_put_vara_int(ncid, varid, start, count, data);

Now every task has called nc_put_vara_int(), as required for a collective operation, and all get NC_NOERR. In practice this will act as an MPI_Barrier() in the code, as all tasks will wait until all other tasks get to the nc_put_vara_int(), and execute it together. Task 0 will do all the work, and the others will just wait. When the collective operation is complete, all tasks will proceed with execution.

So that is why we need counts of 0. For advanced parallel I/O, it's not unusual for only some of the tasks to have data to read/write.

@edhartnett
Copy link
Contributor Author

In terms of backward compatibility, we have indeed promised API compatibility as well as binary format compatibility.

By relaxing the error code checking we do not break API backwards compatibility. Every program that used to work is guaranteed to still work. If we were making the error checking more strict, we would be breaking backward compatibility because working programs would no longer work.

The current state of affairs seems like it's flirting with API incompatibility. A HPC programmer on a system with RELAX_COORD_BOUNDS set may develop some parallel I/O code. Moving to another system (where RELAX_COORD_BOUNDS is not set) will result in getting errors unexpectedly. (And quite confusingly too.)

@wkliao
Copy link
Contributor

wkliao commented Jun 6, 2018

The netcdf 3 format differs from the cdf5 format in that cdf5 non-record dimensions can
have size zero.

This is incorrect. CDF-5 behaves just like CDF-1 and 2.

@edhartnett, please check the PnetCDF 1.8.0 release note. Both NetCDF and PnetCDF return the error code NC_EINVALCOORDS when the value of start is larger than or equal to the defined dimension size, no matter what count value is. When --enable-relax-coord-bound is used, start can be equal to dimension size only if count is zero. It is not whenever count is zero.

I try to make PnetCDF consistent with NetCDF, including error codes. If NetCDF folks decide to change the default behavior, I can make PnetCDF do the same.

To ease the backward compatibility issue, I plan to create a new PnetCDF hint that can be set at the run time to enable/disable this feature. So users do not have to modify their programs.

@DennisHeimbigner
Copy link
Collaborator

I stand corrected.

@edhartnett
Copy link
Contributor Author

OK I am not hearing any dissent in terms of making the RELAX_COORD_BOUNDS be always in effect. After we do so, pnetcdf can do the same.

@WardF
Copy link
Member

WardF commented Jun 6, 2018

I'd like to get input from @czender as well; we've inadvertently broken NCO-based workflows in the past. Assuming no problem there, we can change this behavior.

@WardF WardF self-assigned this Jun 6, 2018
@WardF WardF modified the milestones: future, 4.7.0 Jun 6, 2018
@czender
Copy link
Contributor

czender commented Jun 6, 2018

Thanks for checking. I do not foresee that this will have any noticeable affect on NCO workflows. I would like to check with NCO "make test" once this patch lands in master. Please ping me then.

@edhartnett
Copy link
Contributor Author

Excellent. Thanks Charlie! ;-)

This change is needed for Dennis' vars performance enhancement to work well with parallel I/O. I will package this with the vars changes.

@edhartnett
Copy link
Contributor Author

Actually there is a little complexity here, so I am going to start by just changing the default so that RELAX_COORD_BOUND is enabled. We will take it one step at a time...

@WardF
Copy link
Member

WardF commented Jun 7, 2018

Talking with @DennisHeimbigner, we are good with relaxing coordinate bounds by default, but we'd like to retain the option to turn the functionality off, at least in the near term, for debugging purposes if nothing else; hedging our bets against corner cases, so to speak.

@edhartnett
Copy link
Contributor Author

Yes, I agree. We will take it slow. ;-)

@edhartnett
Copy link
Contributor Author

Another aspect of the current configuration that is not ideal is that the setting is changed based on the setting of pnetcdf.

So the user may do --enable-zero-length-coord-bound and then have it disabled during configure, based on the fact that pnetcdf does not have zero-length-coord-bound enabled.

This will lead to confusing errors for the user, who has every reason to believe that the build worked, and gave what was requested, which was a build with zero-length-coord-bound enabled.

I am not going to fix this problem in this case, because we are taking this option out in the next release, so changing this behavior would only cause a lot of user confusion. Both we and the pnetcdf are changing the default value of this option so for a while netcdf users will see pnetcdfs which do not have this option enabled, and then, eventually, it will be enabled everywhere.

When asked to do something, the configure script should either do it, or error out. It should not try to be smarter than the user by ignoring or changing enable options that the user has requested.

@wkliao
Copy link
Contributor

wkliao commented Jun 22, 2018

I believe the current setting will print a warning message when a conflict is detected, but error out is a better solution.

@edhartnett
Copy link
Contributor Author

OK we have had relaxed coord bounds as the default in master for a few months and no problems.

I suggest that after the 4.6.2 release, I remove this option, as discussed above, and relaxed coord bounds will always be in place, with no option to turn them off.

@wkliao
Copy link
Contributor

wkliao commented Nov 17, 2018

Is completely removing this option necessary?

@edhartnett
Copy link
Contributor Author

edhartnett commented Nov 17, 2018

Is there a reason to keep it?

I think we discussed it pretty thoroughly and decided the coord bounds checking should always be relaxed. So what need for an option?

@wkliao
Copy link
Contributor

wkliao commented Nov 17, 2018

Existing applications using older netcdf versions may still check this error.

@edhartnett
Copy link
Contributor Author

Removing this error cannot affect existing applications. See the discussion above.

@wkliao
Copy link
Contributor

wkliao commented Nov 18, 2018

App may be designed to follow the old rules and spew NC_EINVALCOORDS as a stop condition.
To me, making RELAX_COORD_BOUND the default is sufficient.

@WardF
Copy link
Member

WardF commented Nov 18, 2018

No harm is done leaving this option in imo, in the absence of a pressing reason to remove it. In the event we decide to remove it, it would be marked deprecated, for a few releases at least, so that our community (corner cases or no) aren’t blindsided.

@edhartnett
Copy link
Contributor Author

edhartnett commented Nov 18, 2018

@wkliao that's not possible. Error conditions are being relaxed, not tightened, so the spew of NC_EINVALCOORDS would have occurred without relaxed coord bound checking, not with it. So loosening the checking is not going to cause new errors, and no existing code could be writing data with such calls, because such code would be returning errors.

@WardF the ability to turn off relaxed coord bounds checking is a bug. It needs to be removed.

The reason to remove this option is that it breaks netcdf compatibility across installations. I can write code that works fine on my HPC, but if I move it to a machine where someone has changed this at configure, then that code will break.

So if some poor NOAA programmer is trying to run a model on a workstation, which should work just fine, the model will crash with puzzling IO errors. But there is no reason - the code that works on the HPC will work just fine on the workstation as well, if we don't let hapless installers inadvertently break it.

(Recall that most science users will not compile netCDF, or read the compile documentation. They just use what has been installed. Not only that, they also don't write the I/O code, they just use what is in the model. So the end user will have no clue what is happening.)

NetCDF guarantees API compatibility, and non-relaxed coordinate bounds break that. So relaxed coord bound checking is necessary everywhere. We should not allow the user to turn it off, ever.

@edhartnett
Copy link
Contributor Author

edhartnett commented Dec 3, 2018

@wkliao I see that pnetcdf 1.9.0 has relaxed coord bound checking disabled by default.

Now that relaxed coord bound checking is enabled by netCDF-4 by default (and soon will be required in all cases) I assume you are going to change the pnetcdf default as well.

@wkliao
Copy link
Contributor

wkliao commented Dec 3, 2018

In PnetCDF 1.10.0, the default has been changed to enabled. Please check its release note.

I still think completely removing this configure-time option is not necessary, because historically all netCDF documents clearly define the rule for error code NC_EINVALCOORDS. Make this change as NetCDF default behavior has already broken that rule established there for more than 10 years.

When I created the patch, I need to modify nc_test as it is expecting NC_EINVALCOORDS in a few places. Below is one of changes.

#ifdef RELAX_COORD_BOUND
IF (err != NC_NOERR) /* allowed when edge[j]==0 */
EXPECT_ERR(NC_NOERR, err)
#else
IF (err != NC_EINVALCOORDS) /* not allowed even when edge[j]==0 */
EXPECT_ERR(NC_EINVALCOORDS, err)
#endif

It is possible there are existing applications that use the same practice. I agree with @WardF we should be cautious to avoid breaking existing workflows.

@edhartnett
Copy link
Contributor Author

edhartnett commented Jan 22, 2019

As agreed I have modified master so that being able to turn off relaxed coordinate bounds checking is deprecated and the user is told not to do it. I will remove the ability to disable relaxed coord bounds after the next release.

Turning off relaxed bounds checking is wrong and a bug. It should not be allowed under any circumstances. As already discussed at length this cannot break existing code, except code that was written to expect an error, which cannot work and effect changes on the file. (That is, it is incorrect netCDF API code).

The nc_test/test_put code should not be used as an example of what a user might do. Nor should it be used as a reason to make or not make changes. The test_put code can, should, and does get changed as we fix bugs and make changes. It also does a lot of things that no user will do.

No user code will exist which probes netCDF error conditions, like nc_test does. I want to know that such code exists in the real world before we start making decisions to protect such code.

Actually I don't get why there is so much argument here. It is clearly a bug to allow two different interpretations of how 0 count is treated. Allowing two interpretations allows the user to create code that works on one HPC installation, but mysteriously fails on another HPC installation. What a pain.

@WardF
Copy link
Member

WardF commented Jan 22, 2019

Thanks for making the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants