-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RELAX_COORD_BOUND should be always used and should be removed as an option #1010
Comments
One clue that the current approach is incorrect is the fact that the build does not work if I use --enable-zero-length-coord-bound. If an option is added, then tests must be modified in addition to library code, so that tests still pass with the configure option. (OK. I take that back. The build is only failing because of the new vars functions. Without them, the build does pass with --enable-zero-length-coord-bound and I see that the tests have been modified to accommodate this option.) |
@wkliao and @gsjaardema does pnetcdf offer this option only to remain compatible with netCDF? That is, would pnetcdf prefer to always use relaxed coordinate bounds checking? Or is there some use-case you know of where the stricter checking is necessary? |
This feature was developed per Greg's request. It was added to both PnetCDF and NetCDF around the same time. In #243, Greg mentioned a use case. |
The decision to make it a configure time option is the back compatibility issue. |
Let me propose a set of assertions about this issue.
|
Not sure the section of the specification you cite is relevant. Relaxing the coord bound does not change what kind of file can be created. A dimension size of 0 is still not allowed for non-record dimensions. (It's interesting that cdf5 allows this. However, seems like you could not create a CDF5 file with with netcdf with a non-record dimension of 0 because nc_def_dim() should reject it.) With relaxed coord bounds, the dimension does not get a length of 0. What is 0 is the count being read/written in an nc_get/put_vara_* call. It changes whether a get/put operation will return error when asked to read/write 0 elements of data, or just return NC_NOERR and do no read/write. In terms of parallel I/O it is reasonable for some processes to use counts of 0. For collective I/O, it is required to do so, if there is not data to read/write from all processes. The processes that are not reading/writing have to do something (or the whole app will hang), so what they do is a read/write with a count of 0. This causes the netcdf-c code to create an empty HDF5 dataspace on that process, and do the read/write. (It is a perfectly valid operation in HDF5 to do a read/write for 0 elements.) Other processes in the read/write have non-zero counts, and create non-empty dataspaces for thier reads/writes. It's extremely unlikely to the point of not worth worrying about that a sequential netCDF user would write code that depends on counts of 0 being rejected instead of ignored. What would be the reason for such code? Furthermore, the fact that the counts of 0 are no longer rejected will not cause any change in the netCDF data operations for existing code. Whether the get/put function returns error or NC_NOERR, no data are read/written, because one of the counts is 0. Therefore this cannot break any operational user code. If they are currently getting an error (and ignoring it), then after this change they will not get an error, but their file input/output will remain the same. If they are catching the error and handling it, then they now will not have an error to handle, but the file will still remain as they expect. Coord bounds are going to have to be relaxed for all parallel I/O operations, not just pnetcdf. What we are discussing is whether they should also be relaxed for sequential reads/writes. Does maintaining two methods of checking coord bounds make sense? Seems like we could all switch to relaxed coord bound checking everywhere. Although the travis runs catch some things, and my CI system catches other things, currently no one is testing with relaxed coord bounds. That's a not ideal, considering that this code is at the heart of all read/write operations. A configure option doubles the amount of testing that should be done. Relaxing coord bounds is something that should be done always, to help accommodate HPC needs, and minimize testing, development, and maintenance efforts. Let's eliminate this unneeded option instead of working around it for the next decade. It's doing no good. |
I just wanted to be clear about the zero dimension issue. |
Correct about the zero dimension size, but I can see no other way in which a zero
|
To be clear, we can do this without technically breaking our backward compatibility |
Here is how a zero count can occur. 4 tasks are doing I/O, and they are in collective mode (the default). They all execute code like this:
So a dim of length 100 has been defined, and a var that uses it. Since metadata is always a collective operation, all 4 tasks must execute the above code. Now task 0 wants to fill the whole variable's data. Task 0 can do this:
But the whole application will hang if only task 0 does the nc_put_var_int(), because these are collective operations, which means that all tasks on the communicator must participate. So tasks 1, 2, and 3 also have to do a nc_put_vara_int() call, to the same variable. But they don't have any data to write, so they use a count of 0.
Now every task has called nc_put_vara_int(), as required for a collective operation, and all get NC_NOERR. In practice this will act as an MPI_Barrier() in the code, as all tasks will wait until all other tasks get to the nc_put_vara_int(), and execute it together. Task 0 will do all the work, and the others will just wait. When the collective operation is complete, all tasks will proceed with execution. So that is why we need counts of 0. For advanced parallel I/O, it's not unusual for only some of the tasks to have data to read/write. |
In terms of backward compatibility, we have indeed promised API compatibility as well as binary format compatibility. By relaxing the error code checking we do not break API backwards compatibility. Every program that used to work is guaranteed to still work. If we were making the error checking more strict, we would be breaking backward compatibility because working programs would no longer work. The current state of affairs seems like it's flirting with API incompatibility. A HPC programmer on a system with RELAX_COORD_BOUNDS set may develop some parallel I/O code. Moving to another system (where RELAX_COORD_BOUNDS is not set) will result in getting errors unexpectedly. (And quite confusingly too.) |
This is incorrect. CDF-5 behaves just like CDF-1 and 2. @edhartnett, please check the PnetCDF 1.8.0 release note. Both NetCDF and PnetCDF return the error code NC_EINVALCOORDS when the value of start is larger than or equal to the defined dimension size, no matter what count value is. When --enable-relax-coord-bound is used, start can be equal to dimension size only if count is zero. It is not whenever count is zero. I try to make PnetCDF consistent with NetCDF, including error codes. If NetCDF folks decide to change the default behavior, I can make PnetCDF do the same. To ease the backward compatibility issue, I plan to create a new PnetCDF hint that can be set at the run time to enable/disable this feature. So users do not have to modify their programs. |
I stand corrected. |
OK I am not hearing any dissent in terms of making the RELAX_COORD_BOUNDS be always in effect. After we do so, pnetcdf can do the same. |
I'd like to get input from @czender as well; we've inadvertently broken NCO-based workflows in the past. Assuming no problem there, we can change this behavior. |
Thanks for checking. I do not foresee that this will have any noticeable affect on NCO workflows. I would like to check with NCO "make test" once this patch lands in master. Please ping me then. |
Excellent. Thanks Charlie! ;-) This change is needed for Dennis' vars performance enhancement to work well with parallel I/O. I will package this with the vars changes. |
Actually there is a little complexity here, so I am going to start by just changing the default so that RELAX_COORD_BOUND is enabled. We will take it one step at a time... |
Talking with @DennisHeimbigner, we are good with relaxing coordinate bounds by default, but we'd like to retain the option to turn the functionality off, at least in the near term, for debugging purposes if nothing else; hedging our bets against corner cases, so to speak. |
Yes, I agree. We will take it slow. ;-) |
Another aspect of the current configuration that is not ideal is that the setting is changed based on the setting of pnetcdf. So the user may do --enable-zero-length-coord-bound and then have it disabled during configure, based on the fact that pnetcdf does not have zero-length-coord-bound enabled. This will lead to confusing errors for the user, who has every reason to believe that the build worked, and gave what was requested, which was a build with zero-length-coord-bound enabled. I am not going to fix this problem in this case, because we are taking this option out in the next release, so changing this behavior would only cause a lot of user confusion. Both we and the pnetcdf are changing the default value of this option so for a while netcdf users will see pnetcdfs which do not have this option enabled, and then, eventually, it will be enabled everywhere. When asked to do something, the configure script should either do it, or error out. It should not try to be smarter than the user by ignoring or changing enable options that the user has requested. |
I believe the current setting will print a warning message when a conflict is detected, but error out is a better solution. |
OK we have had relaxed coord bounds as the default in master for a few months and no problems. I suggest that after the 4.6.2 release, I remove this option, as discussed above, and relaxed coord bounds will always be in place, with no option to turn them off. |
Is completely removing this option necessary? |
Is there a reason to keep it? I think we discussed it pretty thoroughly and decided the coord bounds checking should always be relaxed. So what need for an option? |
Existing applications using older netcdf versions may still check this error. |
Removing this error cannot affect existing applications. See the discussion above. |
App may be designed to follow the old rules and spew NC_EINVALCOORDS as a stop condition. |
No harm is done leaving this option in imo, in the absence of a pressing reason to remove it. In the event we decide to remove it, it would be marked deprecated, for a few releases at least, so that our community (corner cases or no) aren’t blindsided. |
@wkliao that's not possible. Error conditions are being relaxed, not tightened, so the spew of NC_EINVALCOORDS would have occurred without relaxed coord bound checking, not with it. So loosening the checking is not going to cause new errors, and no existing code could be writing data with such calls, because such code would be returning errors. @WardF the ability to turn off relaxed coord bounds checking is a bug. It needs to be removed. The reason to remove this option is that it breaks netcdf compatibility across installations. I can write code that works fine on my HPC, but if I move it to a machine where someone has changed this at configure, then that code will break. So if some poor NOAA programmer is trying to run a model on a workstation, which should work just fine, the model will crash with puzzling IO errors. But there is no reason - the code that works on the HPC will work just fine on the workstation as well, if we don't let hapless installers inadvertently break it. (Recall that most science users will not compile netCDF, or read the compile documentation. They just use what has been installed. Not only that, they also don't write the I/O code, they just use what is in the model. So the end user will have no clue what is happening.) NetCDF guarantees API compatibility, and non-relaxed coordinate bounds break that. So relaxed coord bound checking is necessary everywhere. We should not allow the user to turn it off, ever. |
@wkliao I see that pnetcdf 1.9.0 has relaxed coord bound checking disabled by default. Now that relaxed coord bound checking is enabled by netCDF-4 by default (and soon will be required in all cases) I assume you are going to change the pnetcdf default as well. |
In PnetCDF 1.10.0, the default has been changed to enabled. Please check its release note. I still think completely removing this configure-time option is not necessary, because historically all netCDF documents clearly define the rule for error code NC_EINVALCOORDS. Make this change as NetCDF default behavior has already broken that rule established there for more than 10 years. When I created the patch, I need to modify nc_test as it is expecting NC_EINVALCOORDS in a few places. Below is one of changes. Lines 781 to 787 in 5d8f900
It is possible there are existing applications that use the same practice. I agree with @WardF we should be cautious to avoid breaking existing workflows. |
As agreed I have modified master so that being able to turn off relaxed coordinate bounds checking is deprecated and the user is told not to do it. I will remove the ability to disable relaxed coord bounds after the next release. Turning off relaxed bounds checking is wrong and a bug. It should not be allowed under any circumstances. As already discussed at length this cannot break existing code, except code that was written to expect an error, which cannot work and effect changes on the file. (That is, it is incorrect netCDF API code). The nc_test/test_put code should not be used as an example of what a user might do. Nor should it be used as a reason to make or not make changes. The test_put code can, should, and does get changed as we fix bugs and make changes. It also does a lot of things that no user will do. No user code will exist which probes netCDF error conditions, like nc_test does. I want to know that such code exists in the real world before we start making decisions to protect such code. Actually I don't get why there is so much argument here. It is clearly a bug to allow two different interpretations of how 0 count is treated. Allowing two interpretations allows the user to create code that works on one HPC installation, but mysteriously fails on another HPC installation. What a pain. |
Thanks for making the changes. |
As I have been working on this vars code I have had a chance to look into RELAX_COORD_BOUND.
From configure.ac:
To summarize the history, the pnetcdf team and other parallel I/O programmers requested a relaxation of coordinate boundary checking to accommodate counts of 0. They submitted a PR that was made an optional change for netCDF at build time. If RELAX_COORD_BOUND is set, then coordinate boundary checking is relaxed enough to accommodate 0 counts.
The problem is, now a user can write code that works on one machine with RELAX_COORD_BOUND, but breaks on another, which does not have RELAX_COORD_BOUND. And that will be very mysterious. It breaks netCDF API compatibility for HPC users.
The reason this was made optional was to maintain compatibility with existing tests, particularly the difficult nc_test program. And I certainly understand the reluctance to attempt changes to that test code. But this is a cure worse than the disease.
The netcdf rules for coordinate boundary checking were written before parallel I/O. With parallel I/O it is quite reasonable to allow some tasks to have counts of 0. The relaxed coordinate boundary rules of pnetcdf are quite reasonable, and fine for sequential netCDF as well.
Since surely no one is depending on getting an error code, we should make RELAX_COORD_BOUND permanent, and modify our too-strict tests.
If no one objects, then I will make the changes as part of my vars work.
Some history:
https://github.com/Unidata/netcdf-c/pull/319/files
#243
The text was updated successfully, but these errors were encountered: