to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

wetchler · 2015-03-24T00:39:48Z

When converting a string date column to datetime, if the string has a GMT timezone suffix (e.g. "-0800"), it takes 1000x longer to parse:

dates = pd.Series(pd.date_range('1/1/2000', periods=2000))
string_dates = dates.apply(lambda s: str(s))
tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')

%timeit pd.to_datetime(string_dates)
> 1000 loops, best of 3: 579 µs per loop
%timeit pd.to_datetime(tz_string_dates)
> 1 loops, best of 3: 562 ms per loop

Note microseconds vs milliseconds. 3 orders of magnitude... seems unnecessary. This can make loading CSVs into correctly-typed dataframes very, very, very slow for large datasets.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-03-24T03:18:16Z

Indeed, this is a known issue: pandas does not have a timezone aware Block (the internal data structure we use for holding data). Thus, we create dtype=object arrays of datetime objects. If I recall correctly, this is on @jreback's to-do list.

jreback · 2015-03-24T09:36:59Z

This is actually a different issue

no TZ specified

In [11]: %timeit pd.to_datetime(string_dates)
1000 loops, best of 3: 435 us per loop

What you gave; this fallsback to dateutil parsing, which is why its so slow, going
back-forth from cython-python

In [12]: tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')

In [13]: %timeit pd.to_datetime(tz_string_dates)
1 loops, best of 3: 194 ms per loop

Try this

In [15]: tz_string_dates = string_dates.apply(lambda dt: dt + '-0800')

In [16]: %timeit pd.to_datetime(tz_string_dates)
10 loops, best of 3: 23 ms per loop

In [17]: tz_string_dates[0]
Out[17]: '2000-01-01 00:00:00-0800'

The space before the fixed TZ designation throws this off. Its actually an easy fix if you want to look (see src/np_datetime_string.c. That said, I am not 100% correct if this makes this non-ISO (but I agree should prob be parsable in c).

Further it is slowed down relative to non-tz strings because the TZ has to be interpreted for each string. This could be cached actually. So this is point 2 of speedups.

wetchler · 2015-03-24T17:33:28Z

Interesting -- thanks for the tips. Removing the space works, though I'll have to just be vigilant for now about what format csvs are automatically dumped to (in my case I believe the dataset is from a mysql dump). Cheers.

jreback added Enhancement Performance Memory or execution speed performance Timezones Timezone data dtype labels Mar 24, 2015

jreback added this to the Next Major Release milestone Mar 24, 2015

sinhrks mentioned this issue May 4, 2015

Loss of nanosecond resolution when constructing Timestamps from str #10041

Closed

jreback mentioned this issue Dec 26, 2015

PERF: allow even more flexible ISO 8601 datetime parsing #11899

Closed

chris-b1 mentioned this issue Jan 16, 2016

PERF: more flexible iso8601 parsing #12060

Closed

jreback modified the milestones: 0.18.0, Next Major Release Jan 19, 2016

jreback closed this as completed in 5de6b84 Jan 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

wetchler commented Mar 24, 2015

shoyer commented Mar 24, 2015

jreback commented Mar 24, 2015

wetchler commented Mar 24, 2015

to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

to_datetime 1000x slower for timezone-aware strings vs timezone-agnostic #9714

Comments

wetchler commented Mar 24, 2015

shoyer commented Mar 24, 2015

jreback commented Mar 24, 2015

wetchler commented Mar 24, 2015