-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Add cache keyword to to_datetime (#11665) #17077
Conversation
Let me address 1 and 3 first: IIUC, this caching should only speed up performance. The results you get with or without caching should not be different. The fact that you are getting different results indicates something is going weird with the caching. In light of that reasoning, I would expect that caching should default to |
For 2, I think that's worthy enough as its own PR. I'm not sure I fully agree with the patch (i.e. why do we need another special-case After that gets merged, you can rebase this PR onto that one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need asv!
bench the cartesian product of several dtypes (int with unit, str, str with format, dti) and sizes (1000, 100000)
and all dupes, uniques
I suspect under a certain size cache doesn't matter/help, need to see where that is.
pandas/_libs/lib.pyx
Outdated
@@ -375,6 +375,23 @@ cpdef ndarray[object] list_to_object_array(list obj): | |||
|
|||
@cython.wraparound(False) | |||
@cython.boundscheck(False) | |||
cpdef ndarray[object] tuple_to_object_array(tuple obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this, you are dupliating lots of code. I suspect you are hitting
In [2]: pd.unique(('foo'),)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-bd2fadf86f43> in <module>()
----> 1 pd.unique(('foo'),)
/Users/jreback/pandas/pandas/core/algorithms.py in unique(values)
348 """
349
--> 350 values = _ensure_arraylike(values)
351
352 # categorical is a fast-path
/Users/jreback/pandas/pandas/core/algorithms.py in _ensure_arraylike(values)
171 inferred = lib.infer_dtype(values)
172 if inferred in ['mixed', 'string', 'unicode']:
--> 173 values = lib.list_to_object_array(values)
174 else:
175 values = np.asarray(values)
TypeError: Argument 'obj' has incorrect type (expected list, got str)
if so, pls do a separate PR for this (it won't involve any cython code, just a simple change).
pandas/core/algorithms.py
Outdated
@@ -170,7 +170,10 @@ def _ensure_arraylike(values): | |||
ABCIndexClass, ABCSeries)): | |||
inferred = lib.infer_dtype(values) | |||
if inferred in ['mixed', 'string', 'unicode']: | |||
values = lib.list_to_object_array(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
pandas/core/tools/datetimes.py
Outdated
@@ -183,7 +184,8 @@ def _guess_datetime_format_for_array(arr, **kwargs): | |||
|
|||
def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
utc=None, box=True, format=None, exact=True, | |||
unit=None, infer_datetime_format=False, origin='unix'): | |||
unit=None, infer_datetime_format=False, origin='unix', | |||
cache_datetime=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_cache=True
, change name and default to True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this cache=True
?
pandas/core/tools/datetimes.py
Outdated
@@ -257,6 +259,10 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
|
|||
.. versionadded: 0.20.0 | |||
|
|||
cache_datetime : boolean, default False | |||
If True, use a cache of unique, converted dates to apply the datetime | |||
conversion. Produces signficant speed-ups when parsing duplicate dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, add versionadded tag
pandas/core/tools/datetimes.py
Outdated
@@ -340,6 +346,19 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
|
|||
tz = 'utc' if utc else None | |||
|
|||
cache = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can be moved inside _convert_list_like
(maybe)
pandas/core/tools/datetimes.py
Outdated
@@ -340,6 +346,19 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
|
|||
tz = 'utc' if utc else None | |||
|
|||
cache = None | |||
if (cache_datetime and is_list_like(arg) and | |||
not isinstance(arg, DatetimeIndex)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you write an asv (see below), this needs a min len arg (maybe 1000)
pandas/core/tools/datetimes.py
Outdated
# No need to convert with a cache if the arg is already a DatetimeIndex | ||
unique_dates = pd.unique(arg) | ||
if len(unique_dates) != len(arg): | ||
cache = {d: pd.to_datetime(d, errors=errors, dayfirst=dayfirst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very inefficient, you are converting element by-element, simply
Series(to_datetime(unique_dates.....), index=unique_dates)
also avoids iterating over things
pandas/core/tools/datetimes.py
Outdated
result = arg.map(cache) | ||
else: | ||
values = _convert_listlike(arg._values, False, format) | ||
result = pd.Series(values, index=arg.index, name=arg.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave the imports alone
pandas/core/tools/datetimes.py
Outdated
@@ -1,5 +1,6 @@ | |||
from datetime import datetime, timedelta, time | |||
import numpy as np | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use algorithms.unique
below
@@ -306,6 +306,45 @@ def test_to_datetime_tz_psycopg2(self): | |||
dtype='datetime64[ns, UTC]') | |||
tm.assert_index_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize("box", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is ok, but rather I would like to see a parametrize on pretty much every other test function in this file. that tests both use_cache=True/False
@gfyoung In regards to my 3rd point, I agree that the results should be the same whether
This discussion may be more appropriate in a separate issue if it is one. I may not hit this though once I refactor this implementation. |
I think it should. |
there is another issue about utc=True so it's out of scope for this PR |
can you rebase / update |
130406e
to
72e99da
Compare
Hello @mroeschke! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 11, 2017 at 18:39 Hours UTC |
Here are the results from
Addtionally, I edited most all the tests in |
Codecov Report
@@ Coverage Diff @@
## master #17077 +/- ##
==========================================
- Coverage 91.25% 91.22% -0.04%
==========================================
Files 163 163
Lines 49810 49829 +19
==========================================
- Hits 45456 45455 -1
- Misses 4354 4374 +20
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17077 +/- ##
==========================================
- Coverage 91.42% 91.39% -0.04%
==========================================
Files 163 163
Lines 50064 50091 +27
==========================================
+ Hits 45773 45779 +6
- Misses 4291 4312 +21
Continue to review full report at Codecov.
|
pandas/core/tools/datetimes.py
Outdated
|
||
.. versionadded: 0.20.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.21.0
pandas/core/tools/datetimes.py
Outdated
if len(unique_dates) != len(arg): | ||
from pandas import Series | ||
cache_dates = _convert_listlike(unique_dates, False, format) | ||
convert_cache = Series(cache_dates, index=unique_dates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so its better to actually to make convert_cache
a function, which can then take the unique_dates and return the converted data, avoids lots of code duplication.
pandas/core/indexes/datetimes.py
Outdated
@@ -334,7 +334,7 @@ def __new__(cls, data=None, | |||
if not (is_datetime64_dtype(data) or is_datetimetz(data) or | |||
is_integer_dtype(data)): | |||
data = tools.to_datetime(data, dayfirst=dayfirst, | |||
yearfirst=yearfirst) | |||
yearfirst=yearfirst, cache=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason for this chanege?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in order to prevent a RuntimeError
due to recursion.
The cache is built using _convert_listlike
which can return a DatetimeIndex
, then the DatetimeIndex
constructor can call to_datetime
which goes back to _convert_listlike
...
0d5ac41
to
68c7e9f
Compare
Tests are passing now and here are the latest asv results:
|
@@ -111,7 +112,11 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | |||
origin. | |||
|
|||
.. versionadded: 0.20.0 | |||
cache : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is True
pandas/core/tools/datetimes.py
Outdated
@@ -305,6 +310,29 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
except (ValueError, TypeError): | |||
raise e | |||
|
|||
def _maybe_convert_cache(arg, cache, tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a proper doc-string here
pandas/core/tools/datetimes.py
Outdated
result = _maybe_convert_cache(arg, cache, tz) | ||
if result is None: | ||
result = _convert_listlike(arg, box, format, name=arg.name) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you handle these caess list-list/index) inside _maybe_convert_cache
? (I am talking about the else/box part.
what is the 2x slowdown on some of the existing tests? |
It looks like the iso8601 path is so fast the caching always hurts - I think that makes sense, that conversion shouldn't be much more expensive than hashing, but not sure how to handle it. |
The 2x slowdown occurred with iso8601 strings dates without any duplicates. I've attached the profile below of the benchmark with the largest slowdown. It looks like it's expensive put the strings (objects) through
|
Given the potential slowdown, can't we add it as optional keyword? (so False by default) |
thanks @mroeschke nice PR! look forward to pls open an issue for this. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Added a
cache
keyword toto_datetime
to speedup parsing of duplicate dates:Some notes:
I defaulted
cache=False
i.e. don't use a cache to parse the dates. Should the default beTrue
?I used
pd.unique()
to identify unique dates, and the current implementation did not accept a tuple of strings (objects). I addedtuple_to_object_array
and patched_ensure_arraylike
to fix this.There is currently an included test that fails due to a case when using
to_datetime(..., utc=True)
with aSeries
. I am inclined to believeIn[5]
should have been the existing behavior. Thoughts?