-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR deprecate mixed timezone offsets with utc=False? #50887
Comments
I remember addressing this years ago with the goal "maintain the maximum information possible" in datetime parsing, even though you get an |
i think of this as like a more performant |
Sure, but then wouldn't you typically use In [73]: ts = ['2021-03-27T23:59:59+01:00', '2021-03-28T23:59:59+02:00']
In [74]: pd.to_datetime(ts, utc=True).tz_convert('Europe/Vienna')
Out[74]: DatetimeIndex(['2021-03-27 23:59:59+01:00', '2021-03-28 23:59:59+02:00'], dtype='datetime64[ns, Europe/Vienna]', freq=None)
Sure, but why would you want that anyway if you end up with an EDIT: things to address:
|
To follow up from yesterday's discussion, here's an example that was mentioned: what if users just want to keep the local datetime? That would still be faster by calling In [49]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))
In [50]: %%timeit
...: res = ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
...:
...:
232 ms ± 458 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %%timeit
...: res = to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
...:
...:
1.87 s ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) (OK wow, that's a much bigger difference than I was expecting when I started putting together this example...) Furthermore, allowing for multiple timezones slows down the general case, see #50107. If
The issue with returning an cc @jorisvandenbossche and @axil as you also took part in the discussion |
@MarcoGorelli if I run your example, for me the second variant (using to_datetime) is actually faster:
|
🤔 how odd...no idea what that could be down to, any ideas? |
I tried on |
My timeit results on main roughly match Joris's. |
thanks - tried on main, and confirm it reproduces. I haven't pinned down the commit, but I guess some of the I'll keep exploring - there may indeed be a way to keep this around without slowing down the more standard cases |
Here's some more timings, with the latest RC (2.0.0rc1) in a fresh virtual environment In [1]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%d
...: T%H:%M:%S%z'))
In [2]: import datetime as dt Parsing with mixed offets to
|
In the apply case what if the user has mixed format, mixed timezone offset data?
I have personally encountered dates with mixed UTC offsets because the data was collected at different locations with the UTC timestamps corresponding to the "timezone", so DST wasn't necessarily involved. |
In that case you'll end up with |
Probably yeah |
here's an example of this causing issues / unexpected behaviour: #43797 I'll open a tiny PDEP, expand on this a bit more, and call a vote then, gotta either move forwards or close at some point |
I’m coming around on to being more enthusiastic about this |
thanks - I'll try to put together a POC PR showing what the implication would be, hopefully we can find agreement without having to call a vote for something this minor |
Here's a POC of how this could work once deprecated: https://github.com/MarcoGorelli/pandas/pull/7/files I've tried timing ts = pd.Series(pd.date_range('1900', '2000', freq='1h').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))
res = to_datetime(ts, utc=True) and I'm seeing:
But aside from the performance benefits, my main point is about correctness / usefulness / nudging users towards what they probably really want, which is a timezone-aware datetimeindex rather than an object index Finally, what we currently have is value-dependent behaviour: In [33]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00'])
Out[33]: Index([2020-01-01 00:00:00+01:00, 2020-01-02 00:00:00+02:00], dtype='object')
In [34]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+01:00'])
Out[34]: DatetimeIndex(['2020-01-01 00:00:00+01:00', '2020-01-02 00:00:00+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None) which isn't great I'll go ahead an open a PR, if there's strong objections we can discuss there |
If it's about nudging users towards the better behaviour, we could also consider only changing the default behaviour, while leaving the option to those users that have a use case for it. But that makes me wonder: what would be the behaviour after deprecation in case of mixed timezones? Raise an error that you have mixed timezones, which can't be parsed, unless you set utc=True? Or convert to UTC automatically? |
We probably also need to consider the impact on reading CSV files if you have a column with mixed timezones (and if deprecating it in to_datetime, there might need to be an equivalent deprecation in read_csv). |
read_csv keeps a columns as yes, I'm suggesting to raise if you have mixed timezones and What would you suggest changing default to? If you suggesting In [24]: to_datetime(['2020-01-01T00:00'], utc=True)
Out[24]: DatetimeIndex(['2020-01-01 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None) The other issue is complexity, there's a fair bit of it to deal with the mixed-timezone case If someone really wants mixed timezones in an object Series, then In [10]: %%timeit
...: ts.apply(lambda x: Timestamp(x))
...:
...:
3.15 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) isn't impossibly slow, and they'll have to then use |
@MarcoGorelli I took a quick look at a profile of the snippet above with and without your patch. And my understanding is that the difference is almost entirely due to |
And to be clear, I don't feel very strongly about this specific feature. I have never used it myself, but can see potential use cases, but it's also true that the workaround ( |
Currently it returns an object column of Timestamps I think, while after a deprecation it would either raise (not ideal for read_csv?) or return a column of strings (so at least some change in behaviour?)
Yeah, we of course don't want to change that if there are only naive timestamps to start with. A potential default could rather be something like "if we detect multiple timezones, return UTC values". |
Thanks for looking into it! I'd rather not introduce extra complexity to Note that mixed-timezone timestamps already raises >>> to_datetime([Timestamp('2020-01-01 00:00').tz_localize('Europe/Brussels'), Timestamp('2020-01-02 00:00').tz_localize('Asia/Kathmandu')])
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True, at position 1 so we'd just be making the string case consistent with that
You're right, thanks - so yes, this would be a behaviour change, as it would no longer parse that column, it would leave it as-is |
As other reference, pyarrow's read_csv will by default return UTC whenever you have timezone offsets (so for mixed offsets, but also if you have a constant offset). |
thanks for the reference - looks like they do that for In [1]: import pyarrow as pa
In [2]: import pyarrow.compute as pc
In [3]: dates = ['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00']
In [4]: table = pa.table({'arr': pa.array(dates, 'string')})
In [5]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us').type
Out[5]: TimestampType(timestamp[us])
In [6]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us')
Out[6]:
<pyarrow.lib.ChunkedArray object at 0x7f78f59e5ad0>
[
[
2019-12-31 23:00:00.000000,
2020-01-01 22:00:00.000000
]
] |
Just to illustrate how inefficient the current implementation was for diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index 74210a1ce5..778081c32c 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -333,17 +333,22 @@ def _return_parsed_timezone_results(
-------
tz_result : Index-like of parsed dates with timezone
"""
- tz_results = np.empty(len(result), dtype=object)
+ if utc:
+ tz_results = np.empty(len(result), dtype="datetime64[ns]")
+ else:
+ tz_results = np.empty(len(result), dtype=object)
for zone in unique(timezones):
mask = timezones == zone
dta = DatetimeArray(result[mask]).tz_localize(zone)
if utc:
if dta.tzinfo is None:
- dta = dta.tz_localize("utc")
+ dta = dta.tz_localize("utc")._ndarray
else:
- dta = dta.tz_convert("utc")
+ dta = dta.tz_convert("utc")._ndarray
tz_results[mask] = dta
+ if utc:
+ tz_results = DatetimeArray(tz_results).tz_localize("utc")
return Index(tz_results, name=name) and with that the timing of the benchmark from #50887 (comment) for me goes from 2.2s to 800ms.
I liked that workaround, and was thinking that we can mention this in the deprecation (or future error) message. However, one problem with it is that this doesn't give you a way to specify a format (or one of the other keywords of to_datetime). And the more general
There actually seems to be a bug there, in that it returns a plain Timestamp type without tz, while the input has timezones. I would expect it to at least return a tz=UTC timestamp type, just like their read_csv does. |
wow, nice! so pyarrow parses as that's beautiful, could we do that in pandas and get rid of the @jbrockmendel @mroeschke do you have thoughts on mirroring pyarrow here? |
I guess I won't die on the mixed tz offset case hill and would be okay with returning UTC. A It would be nice to remove the utc kwarg from |
I'd stay consistent with pyarrow here and just convert to utc anyway - simple and predictable |
From #53250 , it seems that we can't really convert everything to UTC So, going back to the initial suggestion of just disallowing mixed timezones, will see if I can elaborate |
outcome of today's call: people are generally in favour of raising in such a case, and advising users to pass |
I would like to work on this. |
I guess this can be closed since #54014 was merged. |
Yup thanks! |
Currently, there are two possible behaviours for mixed timezone offsetes:
utc=True
: converts to DatetimeIndex, everything is converted to UTCutc=False
: becomes Index of object dtypeWhat's the use-case for the latter? If it's just
Index
, then you can't use the.dt
accessor or anything that you'd normally do with dates.Should it be deprecated?
The text was updated successfully, but these errors were encountered: