Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

Closed
wetchler opened this issue Mar 24, 2015 · 3 comments
Closed
Labels
Enhancement Performance Memory or execution speed performance Timezones Timezone data dtype
Milestone

Comments

@wetchler
Copy link

When converting a string date column to datetime, if the string has a GMT timezone suffix (e.g. "-0800"), it takes 1000x longer to parse:

dates = pd.Series(pd.date_range('1/1/2000', periods=2000))
string_dates = dates.apply(lambda s: str(s))
tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')

%timeit pd.to_datetime(string_dates)
> 1000 loops, best of 3: 579 µs per loop
%timeit pd.to_datetime(tz_string_dates)
> 1 loops, best of 3: 562 ms per loop

Note microseconds vs milliseconds. 3 orders of magnitude... seems unnecessary. This can make loading CSVs into correctly-typed dataframes very, very, very slow for large datasets.

@shoyer
Copy link
Member

shoyer commented Mar 24, 2015

Indeed, this is a known issue: pandas does not have a timezone aware Block (the internal data structure we use for holding data). Thus, we create dtype=object arrays of datetime objects. If I recall correctly, this is on @jreback's to-do list.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2015

This is actually a different issue

no TZ specified

In [11]: %timeit pd.to_datetime(string_dates)
1000 loops, best of 3: 435 us per loop

What you gave; this fallsback to dateutil parsing, which is why its so slow, going
back-forth from cython-python

In [12]: tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')

In [13]: %timeit pd.to_datetime(tz_string_dates)
1 loops, best of 3: 194 ms per loop

Try this

In [15]: tz_string_dates = string_dates.apply(lambda dt: dt + '-0800')

In [16]: %timeit pd.to_datetime(tz_string_dates)
10 loops, best of 3: 23 ms per loop

In [17]: tz_string_dates[0]
Out[17]: '2000-01-01 00:00:00-0800'

The space before the fixed TZ designation throws this off. Its actually an easy fix if you want to look (see src/np_datetime_string.c. That said, I am not 100% correct if this makes this non-ISO (but I agree should prob be parsable in c).

Further it is slowed down relative to non-tz strings because the TZ has to be interpreted for each string. This could be cached actually. So this is point 2 of speedups.

@jreback jreback added Enhancement Performance Memory or execution speed performance Timezones Timezone data dtype labels Mar 24, 2015
@jreback jreback added this to the Next Major Release milestone Mar 24, 2015
@wetchler
Copy link
Author

Interesting -- thanks for the tips. Removing the space works, though I'll have to just be vigilant for now about what format csvs are automatically dumped to (in my case I believe the dataset is from a mysql dump). Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants