Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDO-like convenience methods to select times #557

Open
markelg opened this issue Sep 2, 2015 · 9 comments
Open

CDO-like convenience methods to select times #557

markelg opened this issue Sep 2, 2015 · 9 comments

Comments

@markelg
Copy link
Contributor

markelg commented Sep 2, 2015

I feel like the time selecting features of xray can be improved. Currently, some common operations are too involved or verbose, like selecting the data in a group of months that are not a standard season (e.g. the monsoon season in india JJAS), or in non consecutive years (e.g. El Niño years). I think it would be great to implement (and easy), some methods inspired in the widely used Climate Data Operators https://code.zmaw.de/projects/cdo For example: selyear, selmon, selday and selhour.

Then we could easily do a composite of JJAS seasons in El Niño years like this:

pr_dataset = xray.open(my_precipitation_dataset)
elnino_years = [year list here...]
pr_dataset.selyear(elnino_years).selmon([6, 7, 8, 9]).mean('time')

This would make me very happy. The way to go would be to write methods that call grouby, then select the years/months, merge them, and return the corresponding dataset/dataarray, but I am not sure about what is the most efficient way to do this.

@markelg markelg changed the title CDO like convenience methods to select times CDO-like convenience methods to select times Sep 2, 2015
@clarkfitzg
Copy link
Member

Just curious- how would we currently do this?

@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

Currently, the way to do this is to create a boolean indexer, with something like the following:

pr_dataset.sel(time=np.in1d(pr_dataset['time.year'], elnino_years))

I agree that this is overly verbose and we can come up with something better. I'm not quite happy with selyear, though:

  • It's ambiguous which datetime variable the year refers do (there can be more than one time variable on some datasets).
  • I'm also not a big fan of adding a bunch of new API methods that are datetime specific -- it creates a lot of noise (pandas has an issue with this).

Something like pr_dataset.sel(year=elnino_years) would be the ideal fix for this second concern (we've discussed this in another issue, can't remember which one now), but it's still ambiguous which time variable it refers to.

So, some other possible ways to spell this:

  1. pr_dataset.sel('time', year=elnino_years)
  2. pr_dataset.sel('time.year', elnino_years)
  3. pr_dataset.sel.time.year(elnino_years)

@jhamman
Copy link
Member

jhamman commented Sep 2, 2015

I'd also like to see this come into being.

My two cents on the syntax is that your second idea is best:

pr_dataset.sel('time.year', elnino_years)

I don't care much for option 3 since the variable name is being used as an attribute (what if my time variable is called "Times"?).

@markelg
Copy link
Contributor Author

markelg commented Sep 3, 2015

I agree with your arguments against creating too many datetime specific methods. However, I am not convinced about using .sel with 'time.year', it looks a bit hackish to me. Maybe all "CDO methods" can be integrated in one single "seltimes" method. Please look at the function below. The time_coord arguments lets you choose the time coordinate that you want to filter, and by using numpy.logical_and is it possible to do the whole operation in one single function call. It could be turned into a method. What do you think?

%matplotlib inline
import xray
import numpy as np
from matplotlib import pyplot as plt

ifile = "HadCRUT.4.3.0.0.median.nc"
ixd = xray.open_dataset(ifile)

You can download the example file from here http://www.metoffice.gov.uk/hadobs/hadcrut4/data/4.3.0.0/gridded_fields/HadCRUT.4.3.0.0.median_netcdf.zip Now comes the function.

def seltime(ixd, time_coord, **kwargs):
    """
    Select time steps by groups of years, months, etc.

    Parameters
    ----------
    ixd : xray.Dataset or xray.DataArray, input dataset. Will be "self" if 
    the function is turned into a method.
    time_coord: str, name of the time coordinate to use. Needed to avoid ambiguities.
    **kwargs: String or interable: Currently years, months, days and hours are supported. 

    Returns
    ----------
    xray.Dataset or xrat.DataArray filtered by the constraints defined
    by the kwargs.
    """
    ntimes = len(ixd.coords[time_coord])
    time_mask = np.repeat(True, ntimes)

    for time_unit, time_values in kwargs.iteritems():
        if time_unit not in ("year", "month", "day", "hour"):
            raise KeyError(time_unit)
        time_str = "{}.{}".format(time_coord, time_unit)
        time_mask_temp = np.in1d(ixd[time_str], time_values)
        time_mask = np.logical_and(time_mask, time_mask_temp)
    oxd = ixd.sel(**{time_coord : time_mask})
    return oxd

It works, and it looks fast enough.

elnino_years = (1958, 1973, 1983, 1987, 1998)
months = (1, 2, 3, 4)
%timeit elnino_xd = seltime(ixd, "time", year=elnino_years, month=months)
elnino_xd.time
    1000 loops, best of 3: 1.05 ms per loop

    <xray.DataArray 'time' (time: 20)>
    array(['1958-01-16T12:00:00.000000000Z', '1958-02-15T00:00:00.000000000Z',
           '1958-03-16T12:00:00.000000000Z', '1958-04-16T00:00:00.000000000Z',
           '1973-01-16T13:00:00.000000000+0100',
           '1973-02-15T01:00:00.000000000+0100',
           '1973-03-16T13:00:00.000000000+0100',
           '1973-04-16T01:00:00.000000000+0100',
           '1983-01-16T13:00:00.000000000+0100',
           '1983-02-15T01:00:00.000000000+0100',
           '1983-03-16T13:00:00.000000000+0100',
           '1983-04-16T02:00:00.000000000+0200',
           '1987-01-16T13:00:00.000000000+0100',
           '1987-02-15T01:00:00.000000000+0100',
           '1987-03-16T13:00:00.000000000+0100',
           '1987-04-16T02:00:00.000000000+0200',
           '1998-01-16T13:00:00.000000000+0100',
           '1998-02-15T01:00:00.000000000+0100',
           '1998-03-16T13:00:00.000000000+0100',
           '1998-04-16T02:00:00.000000000+0200'], dtype='datetime64[ns]')
    Coordinates:
      * time     (time) datetime64[ns] 1958-01-16T12:00:00 1958-02-15 ...
    Attributes:
        start_month: 1
        start_year: 1850
        end_month: 4
        end_year: 2015
        long_name: time
        standard_name: time
        axis: T

If we average in time we can see the warm tonge in the pacific.

plt.figure(figsize=(12, 5))
elnino_xd.temperature_anomaly.mean("time").plot()
plt.gca().invert_yaxis()

output_3_0

@rabernat
Copy link
Contributor

rabernat commented Oct 4, 2018

Is there a way to do this easily now with vectorized indexing?

@shoyer
Copy link
Member

shoyer commented Oct 5, 2018

I don’t think there’s an easy way to this with vectorized indexing, but if we supported multidimensional indexing with boolean keys as proposed in #1887 (equivalent to where with drop=True) we could write pr_dataset[pr_dataset['time.year'].isin(elnino_years)]

@markelg
Copy link
Contributor Author

markelg commented Oct 5, 2018

I though this issue was long forgotten ; ) The fact is that today I still call this seltime function very often in my code so I am glad to see it back, thank you.

I like the .isin(elnino_years) syntax, and I see that it is consistent with pandas. Something like dataset.time.year.isin(elnino_years) would be very nice too, right now is just a "to_index()" away, as this works: dataset.time.to_index().year.isin(elnino_years). However I am not sure about how this is related with multidimensional indexing, as time is only one dimension.

@shoyer
Copy link
Member

shoyer commented Oct 5, 2018

dataset.time.dt.year.isin(elnino_years) should work already with.

@stale
Copy link

stale bot commented Dec 25, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Dec 25, 2020
@dcherian dcherian added API design and removed stale labels Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants