You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When converting a string date column to datetime, if the string has a GMT timezone suffix (e.g. "-0800"), it takes 1000x longer to parse:
dates = pd.Series(pd.date_range('1/1/2000', periods=2000))
string_dates = dates.apply(lambda s: str(s))
tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')
%timeit pd.to_datetime(string_dates)
> 1000 loops, best of 3: 579 µs per loop
%timeit pd.to_datetime(tz_string_dates)
> 1 loops, best of 3: 562 ms per loop
Note microseconds vs milliseconds. 3 orders of magnitude... seems unnecessary. This can make loading CSVs into correctly-typed dataframes very, very, very slow for large datasets.
The text was updated successfully, but these errors were encountered:
Indeed, this is a known issue: pandas does not have a timezone aware Block (the internal data structure we use for holding data). Thus, we create dtype=object arrays of datetime objects. If I recall correctly, this is on @jreback's to-do list.
In [11]: %timeit pd.to_datetime(string_dates)
1000 loops, best of 3: 435 us per loop
What you gave; this fallsback to dateutil parsing, which is why its so slow, going
back-forth from cython-python
In [12]: tz_string_dates = string_dates.apply(lambda dt: dt + ' -0800')
In [13]: %timeit pd.to_datetime(tz_string_dates)
1 loops, best of 3: 194 ms per loop
Try this
In [15]: tz_string_dates = string_dates.apply(lambda dt: dt + '-0800')
In [16]: %timeit pd.to_datetime(tz_string_dates)
10 loops, best of 3: 23 ms per loop
In [17]: tz_string_dates[0]
Out[17]: '2000-01-01 00:00:00-0800'
The space before the fixed TZ designation throws this off. Its actually an easy fix if you want to look (see src/np_datetime_string.c. That said, I am not 100% correct if this makes this non-ISO (but I agree should prob be parsable in c).
Further it is slowed down relative to non-tz strings because the TZ has to be interpreted for each string. This could be cached actually. So this is point 2 of speedups.
Interesting -- thanks for the tips. Removing the space works, though I'll have to just be vigilant for now about what format csvs are automatically dumped to (in my case I believe the dataset is from a mysql dump). Cheers.
When converting a string date column to datetime, if the string has a GMT timezone suffix (e.g. "-0800"), it takes 1000x longer to parse:
Note microseconds vs milliseconds. 3 orders of magnitude... seems unnecessary. This can make loading CSVs into correctly-typed dataframes very, very, very slow for large datasets.
The text was updated successfully, but these errors were encountered: