-
Hi everyone I am working with netcdf4 files. To my suprise it takes me over 5 minutes to open a 1Gb file. I also tried it with the netCDF4 library, which to my understanding is a dependency of xarray. Here it only takes a couple of seconds. I already did some research around the open_dataset performance issues. Most of them seemed to be related to time decoding. So I tried to disable it and also all other decoding options I found. But I wasn't able to see an improvement. Code example below and file can be downloaded here:
Also I am working on a mac with the M3 chip. I appreciate any idea on how to improve performance, because the current state is unusable for me. Best regards :) |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 11 replies
-
@harzer99 I'd think the large amount of variables (~8500) is the bottleneck here. They have to be aligned wrt dimensions and probably these checks take quite some time. As all variables are of the same structure (time, lat, lon) and even the attributes are the same, they might just merged into one very large variable (eg. time, vname, lat, lon) or something like that with an additional coordinate vname consisting of the variable names. I'd need to look up if any of the CLI tools (eg. cdo) are capable of doing that. If not, you might just do this once for your data using xarray. Maybe others have some more easy solution to this problem. |
Beta Was this translation helpful? Give feedback.
-
Why are you specifying |
Beta Was this translation helpful? Give feedback.
-
I tried out many different librarlies for opening the .nc4 file. I ended up building a function that loads the file with the h5py library copies the whole data into a dictionary and then creates a xr.Dataset from that dictionary. This only takes 23s. The function xr.load_dataset, which would provide a similar functionality takes 515s. import h5py
import xarray as xr
import time
from tqdm import tqdm
def custom_load(file_path):
with h5py.File(file_path, "r") as hdf_file:
datasets = {}
for ds_name, ds in tqdm(hdf_file.items()):
if len(ds.shape) == 3:
metadata = {attr_name: ds.attrs[attr_name] for attr_name in ds.attrs}
ds_dict = {
"attrs": metadata,
"data": ds[:],
"dims": ["lat", "long", "prob"],
}
datasets[ds_name] = ds_dict
return datasets
t0 = time.time()
file_path = "biodiversity/historical/bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4"
ds_dict = custom_load(file_path)
# runtime: 23s, Size in memory: 18GB
ds = xr.Dataset.from_dict(ds_dict)
print(f"custom load runtime {time.time()-t0}")
print(f"custom loaded dataset {ds}")
del(ds)
# runtime: 515s, Size in memory: 18GB
t0 = time.time()
ds = xr.load_dataset(file_path, decode_times=False)
print(f"xarray loaded dataset {ds}")
print(f"xr load runtime {time.time()-t0}") Could someone verify this behaviour with an x86 machine? There is a chance that one of the compiled functions xarray is calling is not compiled for apple silicon. Afaik the machine then falls back onto a emulator, which migth be the performance issue. |
Beta Was this translation helpful? Give feedback.
-
Apparently
20ms * 8500 vars = 170 seconds. This gets run twice so at least 340 seconds :) . But we can avoid that quite easily: 50% speedup here: https://github.com/pydata/xarray/pull/9067/files Here's a profile for
Here's a profile for
|
Beta Was this translation helpful? Give feedback.
Apparently
netCDF4._netCDF4.Variable.shape
is quite slow:20ms * 8500 vars = 170 seconds. This gets run twice so at least 340 seconds :) . But we can avoid that quite easily: 50% speedup here: https://github.com/pydata/xarray/pull/9067/files
Here's a profile for
open_store_variable
, I made a small edit to timeNetCDF4ArrayWrapper
separately.