-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unwarranted warning #11
Comments
Successfully wrote a whole year of eORCA12 1 day T-grid mean fields to a bucket (s3://npd12-j001-t1d-1976). Each month was successfully appended in turn but the whole sequence took over 2 hours to complete. This isn't going to be feasible for all the output (remaining U- V- 1d means, all 5d means, all monthly means and annual means) unless the parallelisation via dask allows significant speed-up and is reliable. One observed oddity is that the time_counter dataset has a chunk size of 31 (presumably fixed by the length of January?). This is despite the explicit request of:
in each call. Maybe this is related to the unwarranted warning? Chunksizes have been respected elsewhere; just not for time_counter itself. E.g.:
|
Hi @accowa , I included that warning to inform users that once new data is added to the object store, rechunking is not possible. However, I’ll adjust the warning so that it only appears when users attempt to modify the chunking strategy. Regarding the In my opinion, given that it's a 1D array of time data, we may not need to chunk it. If you prefer, I can include it in the chunking strategy. As for the execution time, Dask does speed up data uploads by parallelizing the transfer of each variable, but I am not sure on how much the performance will increase. However, there are two main bottlenecks right now:
Please let me know if you're happy with these changes. |
The time_counter variable itself doesn't really need to be chunked but a chunk size of 1 does need to be applied to the time_counter dimension because most accesses of 2 and 3d slices/volumes will be done a time-slice at a time. All variables except time_counter have been chunked correctly so it is odd that time_counter has been treated differently. We should get to the bottom of that. Yes, I think a flag to switch off the integrity check would be useful. The checks are useful when establishing new work flows but we may need to take risks to achieve a workable solution. At least such a flag will identify if the checks are indeed a major bottleneck. If it helps, I see the priority order as: 1. flag to suppress integrity checks 2. Dask robustness 3. time_counter chunking. 4. Suppress unwarranted warning |
Hi @accowa , I've implemented the following changes:
|
The time_counter non-chunking is likely a Xarray bug. This looks relevant: pydata/xarray#6204 |
Thanks for pointing me to this link. Considering this, I think the only way to solve this problem now is to rename the coordinate (I don't recommend this), or use the zarr library (and not xarray) the first time I upload the data. In this case, right after uploading the data, I would open the time_counter variable and rechunk it to the chosen strategy. The other data that will be appended later will automatically follow the chunk strategy defined in the first upload. I will add this information to the new issue #14 |
Version 0.1.1 seems to be working much better in intial tests. I'm just writing a series of monthly 1d means to the same bucket with successive calls with identical arguments (except for the input filename). I'm getting this warning on the second and subsequent calls:
☁ msm_os ☁ | WARNING | 2024-09-13 15:49:49 | You already have data in the object store and you can't rechunk it
which is unwarranted since the chunk settings haven't changed?
The text was updated successfully, but these errors were encountered: