-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some iris netcdf saves are fetching all the data. #5753
Comments
Further notesI already investigated this in some depth, and it looks like the problem is the delayed-write mechanism introduced with #5191 . So, I made some experiments to replicate this problem using synthetic data (as the above example), and a variety of storage mechanisms, like this :
Results ..With similar data to the original case, the various above "store_op" options showed these memory usages :
.. which meanswrapping the "lazy store operation" in a delayed is not in itself a problem (option 'lazy-via-delayed') , even with the additional "data" argument to the delayed function (in this case not a lazy object) BUT passing an extra delayed argument which is a calculation on the same data does cause the problem ("lazy-combined" case) HOWEVER, the "lazy-scanandstore" I think shows a possible way out. |
@bouweandela @fnattino have you seen any examples of this? |
Have you seen dask/dask#8380 @pp-mo? I think it may be relevant here, but I'm not entirely sure yet. |
I suspect this issue is caused by some variation of the issue described here. Adding if arraylib is da:
from dask.graph_manipulation import clone
data = clone(data, assume_layers=True) before this code iris/lib/iris/fileformats/netcdf/saver.py Line 314 in c6151e8
seems to avoid the high memory use by decoupling the store graph from the fill value check graph (at the cost of generating the source data twice). |
Well, interesting. But TBH I don't really understand what this is really doing. FWIW I think we are very keen not to fetch data twice, if it can at all be avoided -- and if we accepted that, we could simply do fill-value checking and storage in separate steps. Do you think this approach could possibly be modified to deliver the desired one-pass data store-and-check, @bouweandela ? |
Relevant issue raised on Dask repo |
From a support issue loading a large dataset from a GRIB file and saving to netcdf.
The example data was about 16Gb, and was running out of memory when saving.
It seems that at least some netcdf saves are not able to be correctly saved in a chunk-by-chunk streamed manner.
Simple example to reproduce
from which, a sample output :
For comparison, it is OK in xarray (!! sorry 😬 !!) :
( using same data array )
from which..
Expected result
The total memory required is expected to be only be around N or maybe N+1 * chunk sizes,
where N is the number of Dask workers
-- which here that was =4 (threads or "cpus").
So in this case, approx 8Mb per chunk, expected ~32-40 Mb., as seen in the Xarray case.
The text was updated successfully, but these errors were encountered: