using chunks='auto' inside open_mfdataset call with a cftime coordinate #6332
-
I am trying to process some massively large climate timeseries (~7 TB) and would like to use My problem is that the present dataset has a
Is there any way out? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I think this suggests you have at least one data variable that contains
ds = xr.open_mfdataset(..., chunks="auto", drop_variables=["time_data_var1", "time_data_var2"])
ds = xr.open_mfdataset(..., chunks="auto", decode_cf=False)
ds = xr.decode_cf(ds) This gets a little more complicated if the times in your multi-file dataset are encoded with different units. In that case you need to provide a import cftime
def rebase_times_numpy(arr, units, calendar, target_units):
times = cftime.num2date(arr, units, calendar)
return cftime.date2num(times, target_units, calendar)
def rebase_times(da, units, calendar, target_units="microseconds since 1970-01-01"):
kwargs = {"units": units, "calendar": calendar, "target_units": target_units}
result = xr.apply_ufunc(rebase_times_numpy, da, dask="parallelized", kwargs=kwargs)
return result.assign_attrs(units=target_units, calendar=calendar)
def _harmonize_time_like_dataarrays(d):
time_like_dataarrays = {}
for name, da in d.items():
if "units" in da.attrs and "since" in da.attrs["units"]:
units = da.attrs["units"]
calendar = da.attrs.get("calendar", "standard")
time_like_dataarrays[name] = rebase_times(da, units, calendar)
return time_like_dataarrays
def preprocess(ds):
time_like_data_vars = _harmonize_time_like_dataarrays(ds.data_vars)
time_like_coords = _harmonize_time_like_dataarrays(ds.coords)
return ds.assign(time_like_data_vars).assign_coords(time_like_coords)
ds = xr.open_mfdataset(..., chunks="auto", decode_cf=False, preprocess=preprocess)
ds = xr.decode_cf(ds) I'm just guessing here, though -- it might be helpful to see the output of |
Beta Was this translation helpful? Give feedback.
I think this suggests you have at least one data variable that contains
cftime.datetime
objects in your Dataset (likely in addition to a time coordinate, which will automatically be loaded into memory and not have this issue). A couple options would be:open_mfdataset
call (if you don't need them and you know their names):This gets a little more complicated if the times in your multi-file dataset are enco…