Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF/CLN: Improve datetime-like index ops perf #10277

Merged
merged 1 commit into from
Dec 10, 2015

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jun 4, 2015

Add _isnan to DatetimeIndexOpsMixin to cache NaT mask.

This leads to some perf improvement which is noticeable in larger data.

after fix

import pandas as pd
import numpy as np

np.random.seed(1)
idx = pd.DatetimeIndex(np.append(np.array([pd.tslib.iNaT]),
                       np.random.randint(500, 1000, size=100000000)))
%timeit idx.max()
1 loops, best of 3: 599 ms per loop

%timeit idx.min()
1 loops, best of 3: 608 ms per loop

before fix:

%timeit idx.max()
1 loops, best of 3: 940 ms per loop

%timeit idx.min()
1 loops, best of 3: 883 ms per loop

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

ohh nice!

do we have sufficient benches for this?

@sinhrks
Copy link
Member Author

sinhrks commented Jun 4, 2015

Updated top description. It is harder to notice in small size of Index. Shoud I add vbench for this?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

yes I would add a vbench for this.

btw, I think its actually useful to use .hasnans to skip some of the masking operations

min_stamp = self[~self._isnan].asi8.min()
can be

if self._hasnans:
    self = self[~self._isnan]
min_stamp = self.asi8.min()

on a NO nan series that would be a big win (assume that this is not the first operation, e.g. it obviously needs caching).

here's another thing. I think you can actually do a lot of this not in the constructor but in certain operations
(this may require you to keep another attribute)

e.g.

if you construct from a date_range you KNOW that you don't have any NaT already.
On insert/append you can see if you have any nans (in the appended one and the existing), so you don't actually have to do the check.

So it can still be lazy but in some cases you can set it a-priori w/o actually doing the check.

@jreback jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Jun 4, 2015
@sinhrks
Copy link
Member Author

sinhrks commented Jun 5, 2015

@jreback Exactly! As a first step, added hasnans conditions before mask (_isnan)

I think you can actually do a lot of this not in the constructor but in certain operations
(this may require you to keep another attribute)

I assume it means adding new property to _attributes to show whether data has NaT or not. From my understanding, implementation should be:

  • Adding a flag which shows whether the data MAY contains NaT, like _maybe_hasnans.
  • Update the value to False when we can confirm the value DOESN'T contains NaT
  • When _maybe_hasnans is False, hasnans returns False without checking _isnan.
  • When _maybe_hasnans is True, hasnans checks _isnan and return the result, cache itself and change _maybe_hasnans.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

@sinhrks yep, exactly. For instance, Int64Index, will always be False for this.

@jreback jreback added this to the 0.17.0 milestone Jun 5, 2015
@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

status?

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

@sinhrks status?

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 26, 2015
@sinhrks
Copy link
Member Author

sinhrks commented Aug 31, 2015

I couldn't undertake to add an attribute which indicates `NaN`` possibility.

Because this is PR improve a perf somewhat, can I do it separately?

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

I think this needs to wait for Index.fillna yes?

@sinhrks
Copy link
Member Author

sinhrks commented Oct 19, 2015

Correct, will finish #11343 first.

@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

@sinhrks I belive we can put this in now? move whatsnew note to 0.18.0

@sinhrks sinhrks force-pushed the dti_isnan branch 3 times, most recently from 66a378c to 7b7652f Compare November 24, 2015 20:29
@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

@sinhrks status?

@sinhrks
Copy link
Member Author

sinhrks commented Dec 10, 2015

@jreback Rebased, and ready for review.

I misunderstood about fillna. It can't be used because the target is np.ndarray.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2015

thanks!

jreback added a commit that referenced this pull request Dec 10, 2015
PERF/CLN: Improve datetime-like index ops perf
@jreback jreback merged commit eec0bc6 into pandas-dev:master Dec 10, 2015
@sinhrks sinhrks deleted the dti_isnan branch December 10, 2015 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants