-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xclim Standardised Precipitation Index always throws error #1270
Comments
Hi @GrahamReveley thanks for the issue! Indeed the default xarray behaviour is to return one chunk per resample period and thus with However! I believe that the xarray plugin flox takes care of that! It implements better groupby and resampling algorithms in the context of dask and xarray. If my memory is good, I think that once this package is installed in your environment, xarray's resample on a single time chunk will NOT generate multiple chunks, it will rather preserve this single chunk structure. So, if my intuition is correct, I think we could improve the documentation by mentioning flox and maybe not raise this warning when it was unavoidable (the non-flox case). |
Hi @aulemahal thanks for the quick response! It would be good to get an intuition as to where in the code the Xarray warning is coming from as I think this is causing my dask scheduler to crash with too many tasks (200,000<). Do you have any suggestions as to how to speed up using SPI or Xclim in general? I'm currently working on a fairly large AWS instance on a dask local cluster and struggling to get the speeds I expected from this. |
Clearly having Outputting to zarr datasets has also proven more efficient to me in many cases as parallel writing is better supported AFAIK. Sadly, in a few of my largest workflows the only solutions I found to make run reasonably well was often to compute one indicator at a time (instead of merging all in a dataset and writing it in one go). And even to divide long time series into sections: write a few years at a time and merge those together afterward. This last solution mostly needed for indicators that perform All that said, maybe @coxipi has more insight on SPI-specific optimizations? |
@aulemahal Thanks a lot for the comments, I get that going down the rabbit hole of making the code fully dask optimised could cause some issues in the future. I'll have a look into splitting my data into time chunks prior to performing the SPI calculation and stitch them back together again at the end if it works faster. |
@aulemahal I was about to comment the same thing ( I didn't try the function myself with large datasets, but there was a study by an intern this summer. @RondeauG did Élise have crashing kernel problems in her study? Was she using |
For reference the datasets that I'm using are around 32000(time)x400(lat)x600(long) (with 32000 being 2010-2100 daily data). This obviously gets reduced in the resampling to months. However even after only using 10 years worth of data (3650x400x600) I'm still seeing over 380,000 tasks in my dask graph which really is an awful lot of tasks to get through, even on a big machine. Would the internal use of flex reduce the number of tasks in this graph? As I think the strain on the centralised dask scheduler is one of the reasons it's running so slowly. |
I believe it would. Flox would at least reduce the number of tasks in the |
@coxipi I was going to write an Issue on this very topic soon! Élise didn't have an issue, but she was working with a smaller dataset. However, I've been continuing her work recently and for larger domains, I noticed the issue noted above. What helped was pre-computing the monthly data, saving with no chunks alongside However, I'm still not 100% what worked in those steps. I'll also try to see if Edit: Forgot to mention, but I am computing SPEI1, 3, 6, 9, and 12. |
EDIT: Following the discussion on large datasets - I only discovered xclim this week, and working through the readthedocs (fair point, the package is new, but docstrings could be improved). The reason I am commenting here is on the SPI calculation run time, and GrahamReveley comments about the huge amount of tasks: are there any benchmarks on run times? I am running locally, on average hardware, the following dataset: xarray.DataArray 'pr' time: 7670 y: 125 x: 185 (daily data between 1985 and 2005). I don't have pr_cal data, so feeding the same pr as pr_cal (the documentation is not clear on this point), and I am now wondering if I should perhaps break this up in smaller arrays let's say 25x25, but any optimisation strategy is welcome, as I would like to run future predictions as well (time: 35000). I have been looking for a package like this for some time, and it looks great, so happy to contribute (by testing) and collaborate where possible. |
I too am facing issues with the aforementioned rechunking warnings
@RondeauG I am interested in how exactly you achieved that? My input data is already monthly data so I do not actually need any resampling to be done. How did you modify the SPI function in xclim? By modifying the source code of the installed package? |
Exactly. Since I frequently mess around in xclim, I installed it in my conda environment using As for the changes, I commented the section related to resampling and rechunking in It has sped up my code, but do be aware that you'll still have performance warning because of the |
Thanks for the feedback @jamaa ! What size is your data? I will look closer at this soon. I'll add something like In the meantime, you can use |
With this, I reduced the task number down to 33% of original value for a set of |
my data is of size time: 385, y: 155, x: 188 (monthly data). I am very much for adding an option to skip resampling. I think your other idea of using a subset of pr to determine pr_cal could actually worsen performance in my use case, because I want to calculate SPI for different forecast data while always using the same pr_cal. So maybe it could be an option? |
In my use case, another possible performance improvement could be achieved by having the ability to pre-compute the fitting parameters and then pass them to the SPI function, similar to what is possible in the climate_indices package here. |
<!--Please ensure the PR fulfills the following requirements! --> <!-- If this is your first PR, make sure to add your details to the AUTHORS.rst! --> ### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR fixes #1270 and fixes #1416 and fixes #1474 - [x] Tests for the changes have been added (for bug fixes / features) - [x] (If applicable) Documentation has been added / updated (for bug fixes / features) - [x] HISTORY.rst has been updated (with summary of main changes) - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added ### What kind of change does this PR introduce? * Make SPI/SPEI faster * fit params are now modular, can be computed before computing SPI/SPEI. This allows more options to segment computations and allow to obtain the fitting params if troubleshooting is needed. * time indexing now possible * `dist_method` now avoids `vectorize=True` in its `xr.apply_ufunc`. This is the main improvement in SPI/SPEI. * Better document the limits of usage of standardized indices. Now standardized indices are capped at extreme values ±8.21. The upper bound is a limit resulting of the use of float64. ### Does this PR introduce a breaking change? Yes. * `pr_cal` or `wb_cal` will not be input options in the future: > Inputing `pr_cal` will be deprecated in xclim==0.46.0. If `pr_cal` is a subset of `pr`, then instead of: `standardized_precipitation_index(pr=pr,pr_cal=pr.sel(time=slice(t0,t1)),...)`, one can call: `standardized_precipitation_index(pr=pr,cal_range=(t0,t1),...)`. If for some reason `pr_cal` is not a subset of `pr`, then the following approach will still be possible: `params = standardized_index_fit_params(da=pr_cal, freq=freq, window=window, dist=dist, method=method)`. `spi = standardized_precipitation_index(pr=pr, params=params)`. This approach can be used in both scenarios to break up the computations in two, i.e. get params, then compute standardized indice I could revert this breaking change if we prefer. This was a first attempt to make the computation faster, but the improvements are now independent of this change. We could also keep the modular structure for params, but revert to `pr_cal` instead of `cal_range`. It's a bit less efficient when `pr_cal` is simply a subset of `pr`, because you end up doing resampling/rolling two times on the calibration range for nothing. When first computing `params`, then obtaining `spi` in two steps, then it makes no difference ### Other information:
Setup Information
Description
I'm attempting to calculate Standardised Preciptation Index using Xclim with two data arrays (reference period and projection). Even when specifying the following chunks:
hist = xr.open_dataarray("./historic.nc",chunks={"time":-1,"lat":50,"lon":50})
The SPI indicator function always throws the following error:
From a quick look in the source code, this is due to the resample function in xarray not preserving the time based chunking when resampling i.e. it is chunked time:1, lat:XX, Lon:XX after resampling. Therefore the following code (from the source code) always throws an error and is extremely confusing.
This also leads to the following Xarray warning here:
/home/ubuntu/.local/lib/python3.8/site-packages/xarray/core/indexing.py:1374: PerformanceWarning: Slicing with an out-of-order index is generating 95 times more chunks return self.array[key]
This could also explain the increasing number of dask tasks generated by the SPI indicator function here as well as I think multiple rounds of chunking (one done by the user and one internally) can drastically increase the number of tasks and the movement of data around in RAM.
Something to look out for.
Let me know if there's something wrong with what I'm saying here, happy to discuss.
Thanks!
Steps To Reproduce
No response
Additional context
No response
Contribution
Code of Conduct
The text was updated successfully, but these errors were encountered: