DEPR deprecate mixed timezone offsets with utc=False? #50887

MarcoGorelli · 2023-01-20T08:47:47Z

Currently, there are two possible behaviours for mixed timezone offsetes:

utc=True: converts to DatetimeIndex, everything is converted to UTC
utc=False: becomes Index of object dtype

What's the use-case for the latter? If it's just Index, then you can't use the .dt accessor or anything that you'd normally do with dates.

Should it be deprecated?

The text was updated successfully, but these errors were encountered:

mroeschke · 2023-01-20T15:57:58Z

I remember addressing this years ago with the goal "maintain the maximum information possible" in datetime parsing, even though you get an Index[object] in this case. With this in mind, a common way to encounter this is a user who has a time series with UTC offsets that crosses DST and doesn't want to convert to UTC.

jbrockmendel · 2023-01-20T17:06:28Z

i think of this as like a more performant obj.apply(pd.to_datetime). Definitely would like to get it out of array_to_datetime

MarcoGorelli · 2023-01-26T12:55:04Z

a common way to encounter this is a user who has a time series with UTC offsets that crosses DST and doesn't want to convert to UTC.

Sure, but then wouldn't you typically use utc and then convert to your timezone, e.g.

In [73]: ts = ['2021-03-27T23:59:59+01:00', '2021-03-28T23:59:59+02:00']

In [74]: pd.to_datetime(ts, utc=True).tz_convert('Europe/Vienna')
Out[74]: DatetimeIndex(['2021-03-27 23:59:59+01:00', '2021-03-28 23:59:59+02:00'], dtype='datetime64[ns, Europe/Vienna]', freq=None)

i think of this as like a more performant obj.apply(pd.to_datetime).

Sure, but why would you want that anyway if you end up with an object Series which you can't use the .dt namespace on anyway?

EDIT: things to address:

extracting local datetime
splitting local datetime and offset to different columns

MarcoGorelli · 2023-02-09T12:09:49Z

To follow up from yesterday's discussion, here's an example that was mentioned: what if users just want to keep the local datetime?

That would still be faster by calling apply directly:

In [49]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))

In [50]: %%timeit
    ...: res = ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
    ...:
    ...:
232 ms ± 458 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %%timeit
    ...: res = to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
    ...:
    ...:
1.87 s ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(OK wow, that's a much bigger difference than I was expecting when I started putting together this example...)

Furthermore, allowing for multiple timezones slows down the general case, see #50107.

If tz was only allowed to be a single time zone or None, then:

in the single-time-zone case, we'd get a speedup by not spending so much time in _return_parsed_timezone_results
in the mixed-time-zone case, users would need to use apply directly, and they'd get a speedup too

The issue with returning an object Series is that if you want to do anything useful with it, you're going to have to use apply anyway

cc @jorisvandenbossche and @axil as you also took part in the discussion

jorisvandenbossche · 2023-02-10T15:24:41Z

@MarcoGorelli if I run your example, for me the second variant (using to_datetime) is actually faster:

In [1]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))

In [2]: import datetime as dt

In [3]: %timeit ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
372 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit pd.to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
280 ms ± 9.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

MarcoGorelli · 2023-02-10T15:28:27Z

🤔 how odd...no idea what that could be down to, any ideas?

MarcoGorelli · 2023-02-10T15:32:15Z

I tried on ~~someone else's machine~~ Kaggle, and could reproduce the slow-down: https://www.kaggle.com/marcogorelli/to-datetime-example

jbrockmendel · 2023-02-10T16:57:23Z

My timeit results on main roughly match Joris's.

MarcoGorelli · 2023-02-10T17:30:43Z

thanks - tried on main, and confirm it reproduces. I haven't pinned down the commit, but I guess some of the to_datetime work really sped it up since 1.5.x

I'll keep exploring - there may indeed be a way to keep this around without slowing down the more standard cases

MarcoGorelli · 2023-03-24T18:20:31Z

Here's some more timings, with the latest RC (2.0.0rc1) in a fresh virtual environment

In [1]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%d
   ...: T%H:%M:%S%z'))

In [2]: import datetime as dt

Parsing with mixed offets to `object`, then using `apply`

In [3]: %timeit pd.to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
182 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Parsing with mixed offsets directly using `apply`

In [5]: %timeit ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
244 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slightly slower, but similar order of magnitude.

But what this hides is that the reason the user had mixed offsets was almost certainly due to DST, and so they should probably be doing

pd.to_datetime(ts, utc=True).dt.tz_convert('Europe/Brussels')

Using this to do the same operation as above (i.e. parse, and then remove the time zone):

In [7]: %timeit pd.to_datetime(ts, utc=True).dt.tz_convert('Europe/Brussels').dt.tz_localize(None)
64.6 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

we get the most performant solution (3x - 4x faster than the above)

So, the cases to consider are:

multiple offsets due to dst:
- user would be best off doing pd.to_datetime(ts, utc=True).dt.tz_convert(time_zone)
multiple offsets due to different time zones:
- user will probably need to use apply anyway, so might as well nudge them towards that rightaway

I'm fairly confident that this would also reduce complexity substantially, and further improve performance in the "regular" case. Perhaps I'll put up a draft PR to demonstrate if that helps

mroeschke · 2023-03-24T18:47:10Z

In the apply case what if the user has mixed format, mixed timezone offset data?

In [3]: pd.to_datetime(["2020-01-01", "2020-01-01T00:00:00+04:00", "2020-01-01T00:00:00-06:00"], format="mixed")
Out[3]:
Index([2020-01-01 00:00:00, 2020-01-01 00:00:00+04:00,
       2020-01-01 00:00:00-06:00],
      dtype='object')

But what this hides is that the reason the user had mixed offsets was almost certainly due to DST, and so they should probably be doing

I have personally encountered dates with mixed UTC offsets because the data was collected at different locations with the UTC timestamps corresponding to the "timezone", so DST wasn't necessarily involved.

MarcoGorelli · 2023-03-24T18:54:21Z

In the apply case what if the user has mixed format, mixed timezone offset data?

In that case you'll end up with object dtype, and will have to use apply anyway - so my argument is that you'd be better off using apply directly
I presume you had to use apply operations in the scenario you're describing?

mroeschke · 2023-03-24T20:21:33Z

Probably yeah

MarcoGorelli · 2023-04-28T11:31:07Z

here's an example of this causing issues / unexpected behaviour: #43797

I'll open a tiny PDEP, expand on this a bit more, and call a vote then, gotta either move forwards or close at some point

jbrockmendel · 2023-05-04T03:08:48Z

I’m coming around on to being more enthusiastic about this

MarcoGorelli · 2023-05-04T08:23:15Z

thanks - I'll try to put together a POC PR showing what the implication would be, hopefully we can find agreement without having to call a vote for something this minor

MarcoGorelli · 2023-05-05T09:08:16Z

Here's a POC of how this could work once deprecated: https://github.com/MarcoGorelli/pandas/pull/7/files

I've tried timing

ts = pd.Series(pd.date_range('1900', '2000', freq='1h').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))
res = to_datetime(ts, utc=True)

and I'm seeing:

in that branch: 563 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
upstream/main: 1.71 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But aside from the performance benefits, my main point is about correctness / usefulness / nudging users towards what they probably really want, which is a timezone-aware datetimeindex rather than an object index

Finally, what we currently have is value-dependent behaviour:

In [33]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00'])
Out[33]: Index([2020-01-01 00:00:00+01:00, 2020-01-02 00:00:00+02:00], dtype='object')

In [34]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+01:00'])
Out[34]: DatetimeIndex(['2020-01-01 00:00:00+01:00', '2020-01-02 00:00:00+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

which isn't great

I'll go ahead an open a PR, if there's strong objections we can discuss there

jorisvandenbossche · 2023-05-05T12:03:05Z

But aside from the performance benefits, my main point is about correctness / usefulness / nudging users towards what they probably really want, which is a timezone-aware datetimeindex rather than an object index

If it's about nudging users towards the better behaviour, we could also consider only changing the default behaviour, while leaving the option to those users that have a use case for it.
(I know it's not only about this, though, also performance)

But that makes me wonder: what would be the behaviour after deprecation in case of mixed timezones? Raise an error that you have mixed timezones, which can't be parsed, unless you set utc=True? Or convert to UTC automatically?

jorisvandenbossche · 2023-05-05T12:09:14Z

We probably also need to consider the impact on reading CSV files if you have a column with mixed timezones (and if deprecating it in to_datetime, there might need to be an equivalent deprecation in read_csv).

MarcoGorelli · 2023-05-05T13:27:41Z

read_csv keeps a columns as object if it can't be parsed, so I don't think there'd be much of an impact there? other than that the elements would now be ymd hms formatted

yes, I'm suggesting to raise if you have mixed timezones and utc=False

What would you suggest changing default to? If you suggesting utc defaulting to True, then my reservation is that then timezone-naive dates would become UTC-aware ones

In [24]: to_datetime(['2020-01-01T00:00'], utc=True)
Out[24]: DatetimeIndex(['2020-01-01 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

The other issue is complexity, there's a fair bit of it to deal with the mixed-timezone case

If someone really wants mixed timezones in an object Series, then

In [10]: %%timeit
    ...: ts.apply(lambda x: Timestamp(x))
    ...: 
    ...: 
3.15 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

isn't impossibly slow, and they'll have to then use apply to do anything with their new object Series anyway, so this performance would have to be acceptable to them anyway. Note that this is on nearly 1 million rows

jorisvandenbossche · 2023-05-05T13:29:20Z

@MarcoGorelli I took a quick look at a profile of the snippet above with and without your patch. And my understanding is that the difference is almost entirely due to _return_parsed_timezone_results, which in the current code seems to be a bottleneck (it's masking out parts, converting them, assigning back into an (object dtype!) array, and then in the end convert the objects back to datetime64 index). That seems quite inefficient, and I would assume that this could be optimized, regardless of the discussion here. I think it should be possible to avoid this costly conversion, especially in the typical case of strings with utc=True that you profiled above.
(one idea, if utc=True, already apply the offset within the loop in array_strptime and localize the result to UTC afterwards (a metadata-change-only tz_localize(UTC), and only keep track of the array of timezones if utc=False. And even for that case, I think it should be possible to re-write _return_parsed_timezone_results to make it less inefficient)

jorisvandenbossche · 2023-05-05T13:33:11Z

And to be clear, I don't feel very strongly about this specific feature. I have never used it myself, but can see potential use cases, but it's also true that the workaround (ts.apply(lambda x: Timestamp(x))) is OK (just slower, but then you are using object dtype and thus in slower code paths anyway). And the complexity argument is certainly valid, so if the benefits are only minor, the trade-off seems to gravitate towards not needing to keep this.
My post above is mostly because I was just curious where the performance difference was coming from :)

jorisvandenbossche · 2023-05-05T13:38:42Z

read_csv keeps a columns as object if it can't be parsed, so I don't think there'd be much of an impact there?

Currently it returns an object column of Timestamps I think, while after a deprecation it would either raise (not ideal for read_csv?) or return a column of strings (so at least some change in behaviour?)

yes, I'm suggesting to raise if you have mixed timezones and utc=False

What would you suggest changing default to? If you suggesting utc defaulting to True, then my reservation is that then timezone-naive dates would become UTC-aware ones

Yeah, we of course don't want to change that if there are only naive timestamps to start with. A potential default could rather be something like "if we detect multiple timezones, return UTC values".

MarcoGorelli · 2023-05-05T13:44:58Z

Thanks for looking into it!

I'd rather not introduce extra complexity to array_strptime, it's already a complex-ish part of pandas with not many people taking a look at it

Note that mixed-timezone timestamps already raises

>>> to_datetime([Timestamp('2020-01-01 00:00').tz_localize('Europe/Brussels'), Timestamp('2020-01-02 00:00').tz_localize('Asia/Kathmandu')])
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True, at position 1

so we'd just be making the string case consistent with that

Currently it returns an object column of Timestamps

You're right, thanks - so yes, this would be a behaviour change, as it would no longer parse that column, it would leave it as-is

jorisvandenbossche · 2023-05-05T13:44:59Z

As other reference, pyarrow's read_csv will by default return UTC whenever you have timezone offsets (so for mixed offsets, but also if you have a constant offset).

MarcoGorelli · 2023-05-05T13:55:45Z

thanks for the reference - looks like they do that for strptime as well, so they don't return a non-TimestampType column either

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: dates = ['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00']

In [4]: table = pa.table({'arr': pa.array(dates, 'string')})

In [5]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us').type
Out[5]: TimestampType(timestamp[us])

In [6]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us')
Out[6]:
<pyarrow.lib.ChunkedArray object at 0x7f78f59e5ad0>
[
  [
    2019-12-31 23:00:00.000000,
    2020-01-01 22:00:00.000000
  ]
]

jorisvandenbossche · 2023-05-05T16:28:14Z

Just to illustrate how inefficient the current implementation was for utc=True, the following small patch avoids a roundtrip through object dtype for that case:

diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index 74210a1ce5..778081c32c 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -333,17 +333,22 @@ def _return_parsed_timezone_results(
     -------
     tz_result : Index-like of parsed dates with timezone
     """
-    tz_results = np.empty(len(result), dtype=object)
+    if utc:
+        tz_results = np.empty(len(result), dtype="datetime64[ns]")
+    else:
+        tz_results = np.empty(len(result), dtype=object)
     for zone in unique(timezones):
         mask = timezones == zone
         dta = DatetimeArray(result[mask]).tz_localize(zone)
         if utc:
             if dta.tzinfo is None:
-                dta = dta.tz_localize("utc")
+                dta = dta.tz_localize("utc")._ndarray
             else:
-                dta = dta.tz_convert("utc")
+                dta = dta.tz_convert("utc")._ndarray
         tz_results[mask] = dta
 
+    if utc:
+        tz_results = DatetimeArray(tz_results).tz_localize("utc")
     return Index(tz_results, name=name)

and with that the timing of the benchmark from #50887 (comment) for me goes from 2.2s to 800ms.

If someone really wants mixed timezones in an object Series, then ... ts.apply(lambda x: Timestamp(x))

I liked that workaround, and was thinking that we can mention this in the deprecation (or future error) message. However, one problem with it is that this doesn't give you a way to specify a format (or one of the other keywords of to_datetime).

And the more general ts.apply(lambda x: pd.to_datetime(x)) is unfortunately horribly slow .. (10min vs 4s with Timestamp on the benchmark case)

looks like they [pyarrow] do that for strptime as well, so they don't return a non-TimestampType column either

There actually seems to be a bug there, in that it returns a plain Timestamp type without tz, while the input has timezones. I would expect it to at least return a tz=UTC timestamp type, just like their read_csv does.
Thanks for showing that example ;) Opened apache/arrow#35448 (and have a PR to fix it)

MarcoGorelli · 2023-05-05T17:49:02Z

wow, nice! so pyarrow parses as TimestampType(timestamp[us]) if elements are tz-naive, and TimestampType(timestamp[us, tz=UTC]) if they're tz-aware?

that's beautiful, could we do that in pandas and get rid of the utc kwarg? ending up with a fixed-offset timezone like 'datetime64[ns, UTC+01:00]' is mostly useless anyway, what you really want as a user is an 'Area/Location' timezone anyway

@jbrockmendel @mroeschke do you have thoughts on mirroring pyarrow here?

mroeschke · 2023-05-05T18:50:08Z

I guess I won't die on the mixed tz offset case hill and would be okay with returning UTC. A UserWarning would be nice noting if multiple different offsets were detected. What about if the offsets were all the same?

It would be nice to remove the utc kwarg from to_datetime too.

MarcoGorelli · 2023-05-05T21:31:42Z

What about if the offsets were all the same?

I'd stay consistent with pyarrow here and just convert to utc anyway - simple and predictable

MarcoGorelli · 2023-05-16T18:39:40Z

From #53250 , it seems that we can't really convert everything to UTC

So, going back to the initial suggestion of just disallowing mixed timezones, will see if I can elaborate

MarcoGorelli · 2023-06-28T20:02:23Z

outcome of today's call: people are generally in favour of raising in such a case, and advising users to pass utc=True

natmokval · 2023-06-30T09:49:56Z

I would like to work on this.

lafrech · 2023-09-04T10:16:08Z

I guess this can be closed since #54014 was merged.

MarcoGorelli · 2023-09-04T10:56:58Z

Yup thanks!

MarcoGorelli added Datetime Datetime data dtype Deprecate Functionality to remove in pandas labels Jan 20, 2023

MarcoGorelli mentioned this issue Jan 26, 2023

utc parameter in strptime pola-rs/polars#6466

Closed

MarcoGorelli mentioned this issue Apr 28, 2023

BUG: DataFrame resample fails across daylight savings boundary #43797

Closed

3 tasks

MarcoGorelli mentioned this issue May 5, 2023

Please clarify if a single column can contain multiple timezones pola-rs/polars#8289

Closed

2 tasks

MarcoGorelli mentioned this issue May 16, 2023

API: should only Area/Location time-zone-identifiers (other than UTC) be allowed? #53250

Closed

natmokval mentioned this issue Jul 5, 2023

DEPR: passing mixed offsets with utc=False into to_datetime #54014

Merged

MarcoGorelli closed this as completed Sep 4, 2023

lafrech mentioned this issue Sep 4, 2023

ENH: allow requiring TZ-aware datetimes in to_datetime #54995

Closed

3 tasks

m-richards mentioned this issue Sep 24, 2023

ENH: support for reading and writing datetimes with timezones geopandas/pyogrio#253

Merged

DEPR deprecate mixed timezone offsets with utc=False? #50887

DEPR deprecate mixed timezone offsets with utc=False? #50887

Comments

MarcoGorelli commented Jan 20, 2023

mroeschke commented Jan 20, 2023

jbrockmendel commented Jan 20, 2023

MarcoGorelli commented Jan 26, 2023 • edited Loading

MarcoGorelli commented Feb 9, 2023

jorisvandenbossche commented Feb 10, 2023

MarcoGorelli commented Feb 10, 2023

MarcoGorelli commented Feb 10, 2023

jbrockmendel commented Feb 10, 2023

MarcoGorelli commented Feb 10, 2023

MarcoGorelli commented Mar 24, 2023

Parsing with mixed offets to object, then using apply

Parsing with mixed offsets directly using apply

mroeschke commented Mar 24, 2023

MarcoGorelli commented Mar 24, 2023

mroeschke commented Mar 24, 2023

MarcoGorelli commented Apr 28, 2023

jbrockmendel commented May 4, 2023

MarcoGorelli commented May 4, 2023 • edited Loading

MarcoGorelli commented May 5, 2023 • edited Loading

jorisvandenbossche commented May 5, 2023

jorisvandenbossche commented May 5, 2023

MarcoGorelli commented May 5, 2023

jorisvandenbossche commented May 5, 2023

jorisvandenbossche commented May 5, 2023 • edited Loading

jorisvandenbossche commented May 5, 2023

MarcoGorelli commented May 5, 2023

jorisvandenbossche commented May 5, 2023

MarcoGorelli commented May 5, 2023

jorisvandenbossche commented May 5, 2023

MarcoGorelli commented May 5, 2023

mroeschke commented May 5, 2023

MarcoGorelli commented May 5, 2023

MarcoGorelli commented May 16, 2023

MarcoGorelli commented Jun 28, 2023

natmokval commented Jun 30, 2023

lafrech commented Sep 4, 2023

MarcoGorelli commented Sep 4, 2023

MarcoGorelli commented Jan 26, 2023 •

edited

Loading

Parsing with mixed offets to `object`, then using `apply`

Parsing with mixed offsets directly using `apply`

MarcoGorelli commented May 4, 2023 •

edited

Loading

MarcoGorelli commented May 5, 2023 •

edited

Loading

jorisvandenbossche commented May 5, 2023 •

edited

Loading