Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR deprecate mixed timezone offsets with utc=False? #50887

Closed
MarcoGorelli opened this issue Jan 20, 2023 · 35 comments
Closed

DEPR deprecate mixed timezone offsets with utc=False? #50887

MarcoGorelli opened this issue Jan 20, 2023 · 35 comments
Labels
Datetime Datetime data dtype Deprecate Functionality to remove in pandas

Comments

@MarcoGorelli
Copy link
Member

Currently, there are two possible behaviours for mixed timezone offsetes:

  • utc=True: converts to DatetimeIndex, everything is converted to UTC
  • utc=False: becomes Index of object dtype

What's the use-case for the latter? If it's just Index, then you can't use the .dt accessor or anything that you'd normally do with dates.

Should it be deprecated?

@MarcoGorelli MarcoGorelli added Datetime Datetime data dtype Deprecate Functionality to remove in pandas labels Jan 20, 2023
@mroeschke
Copy link
Member

I remember addressing this years ago with the goal "maintain the maximum information possible" in datetime parsing, even though you get an Index[object] in this case. With this in mind, a common way to encounter this is a user who has a time series with UTC offsets that crosses DST and doesn't want to convert to UTC.

@jbrockmendel
Copy link
Member

i think of this as like a more performant obj.apply(pd.to_datetime). Definitely would like to get it out of array_to_datetime

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 26, 2023

a common way to encounter this is a user who has a time series with UTC offsets that crosses DST and doesn't want to convert to UTC.

Sure, but then wouldn't you typically use utc and then convert to your timezone, e.g.

In [73]: ts = ['2021-03-27T23:59:59+01:00', '2021-03-28T23:59:59+02:00']

In [74]: pd.to_datetime(ts, utc=True).tz_convert('Europe/Vienna')
Out[74]: DatetimeIndex(['2021-03-27 23:59:59+01:00', '2021-03-28 23:59:59+02:00'], dtype='datetime64[ns, Europe/Vienna]', freq=None)

i think of this as like a more performant obj.apply(pd.to_datetime).

Sure, but why would you want that anyway if you end up with an object Series which you can't use the .dt namespace on anyway?


EDIT: things to address:

  • extracting local datetime
  • splitting local datetime and offset to different columns

@MarcoGorelli
Copy link
Member Author

To follow up from yesterday's discussion, here's an example that was mentioned: what if users just want to keep the local datetime?

That would still be faster by calling apply directly:

In [49]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))

In [50]: %%timeit
    ...: res = ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
    ...:
    ...:
232 ms ± 458 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %%timeit
    ...: res = to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
    ...:
    ...:
1.87 s ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(OK wow, that's a much bigger difference than I was expecting when I started putting together this example...)

Furthermore, allowing for multiple timezones slows down the general case, see #50107.

If tz was only allowed to be a single time zone or None, then:

  • in the single-time-zone case, we'd get a speedup by not spending so much time in _return_parsed_timezone_results
  • in the mixed-time-zone case, users would need to use apply directly, and they'd get a speedup too

The issue with returning an object Series is that if you want to do anything useful with it, you're going to have to use apply anyway

cc @jorisvandenbossche and @axil as you also took part in the discussion

@jorisvandenbossche
Copy link
Member

@MarcoGorelli if I run your example, for me the second variant (using to_datetime) is actually faster:

In [1]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))

In [2]: import datetime as dt

In [3]: %timeit ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
372 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit pd.to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
280 ms ± 9.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@MarcoGorelli
Copy link
Member Author

🤔 how odd...no idea what that could be down to, any ideas?

@MarcoGorelli
Copy link
Member Author

I tried on someone else's machine Kaggle, and could reproduce the slow-down: https://www.kaggle.com/marcogorelli/to-datetime-example

@jbrockmendel
Copy link
Member

My timeit results on main roughly match Joris's.

@MarcoGorelli
Copy link
Member Author

thanks - tried on main, and confirm it reproduces. I haven't pinned down the commit, but I guess some of the to_datetime work really sped it up since 1.5.x

I'll keep exploring - there may indeed be a way to keep this around without slowing down the more standard cases

@MarcoGorelli
Copy link
Member Author

Here's some more timings, with the latest RC (2.0.0rc1) in a fresh virtual environment

In [1]: ts = pd.Series(pd.date_range('1900', '2000').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%d
   ...: T%H:%M:%S%z'))

In [2]: import datetime as dt

Parsing with mixed offets to object, then using apply

In [3]: %timeit pd.to_datetime(ts).apply(lambda x: x.replace(tzinfo=None))
182 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Parsing with mixed offsets directly using apply

In [5]: %timeit ts.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').replace(tzinfo=None))
244 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slightly slower, but similar order of magnitude.

But what this hides is that the reason the user had mixed offsets was almost certainly due to DST, and so they should probably be doing

pd.to_datetime(ts, utc=True).dt.tz_convert('Europe/Brussels')

Using this to do the same operation as above (i.e. parse, and then remove the time zone):

In [7]: %timeit pd.to_datetime(ts, utc=True).dt.tz_convert('Europe/Brussels').dt.tz_localize(None)
64.6 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

we get the most performant solution (3x - 4x faster than the above)


So, the cases to consider are:

  • multiple offsets due to dst:
    • user would be best off doing pd.to_datetime(ts, utc=True).dt.tz_convert(time_zone)
  • multiple offsets due to different time zones:
    • user will probably need to use apply anyway, so might as well nudge them towards that rightaway

I'm fairly confident that this would also reduce complexity substantially, and further improve performance in the "regular" case. Perhaps I'll put up a draft PR to demonstrate if that helps

@mroeschke
Copy link
Member

In the apply case what if the user has mixed format, mixed timezone offset data?

In [3]: pd.to_datetime(["2020-01-01", "2020-01-01T00:00:00+04:00", "2020-01-01T00:00:00-06:00"], format="mixed")
Out[3]:
Index([2020-01-01 00:00:00, 2020-01-01 00:00:00+04:00,
       2020-01-01 00:00:00-06:00],
      dtype='object')

But what this hides is that the reason the user had mixed offsets was almost certainly due to DST, and so they should probably be doing

I have personally encountered dates with mixed UTC offsets because the data was collected at different locations with the UTC timestamps corresponding to the "timezone", so DST wasn't necessarily involved.

@MarcoGorelli
Copy link
Member Author

In the apply case what if the user has mixed format, mixed timezone offset data?

In that case you'll end up with object dtype, and will have to use apply anyway - so my argument is that you'd be better off using apply directly
I presume you had to use apply operations in the scenario you're describing?

@mroeschke
Copy link
Member

Probably yeah

@MarcoGorelli
Copy link
Member Author

here's an example of this causing issues / unexpected behaviour: #43797

I'll open a tiny PDEP, expand on this a bit more, and call a vote then, gotta either move forwards or close at some point

@jbrockmendel
Copy link
Member

I’m coming around on to being more enthusiastic about this

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented May 4, 2023

thanks - I'll try to put together a POC PR showing what the implication would be, hopefully we can find agreement without having to call a vote for something this minor

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented May 5, 2023

Here's a POC of how this could work once deprecated: https://github.com/MarcoGorelli/pandas/pull/7/files

I've tried timing

ts = pd.Series(pd.date_range('1900', '2000', freq='1h').tz_localize('UTC').tz_convert('Europe/Brussels').strftime('%Y-%m-%dT%H:%M:%S%z'))
res = to_datetime(ts, utc=True)

and I'm seeing:

  • in that branch: 563 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • upstream/main: 1.71 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But aside from the performance benefits, my main point is about correctness / usefulness / nudging users towards what they probably really want, which is a timezone-aware datetimeindex rather than an object index


Finally, what we currently have is value-dependent behaviour:

In [33]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00'])
Out[33]: Index([2020-01-01 00:00:00+01:00, 2020-01-02 00:00:00+02:00], dtype='object')

In [34]: to_datetime(['2020-01-01T00:00+01:00', '2020-01-02T00:00+01:00'])
Out[34]: DatetimeIndex(['2020-01-01 00:00:00+01:00', '2020-01-02 00:00:00+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

which isn't great


I'll go ahead an open a PR, if there's strong objections we can discuss there

@jorisvandenbossche
Copy link
Member

But aside from the performance benefits, my main point is about correctness / usefulness / nudging users towards what they probably really want, which is a timezone-aware datetimeindex rather than an object index

If it's about nudging users towards the better behaviour, we could also consider only changing the default behaviour, while leaving the option to those users that have a use case for it.
(I know it's not only about this, though, also performance)

But that makes me wonder: what would be the behaviour after deprecation in case of mixed timezones? Raise an error that you have mixed timezones, which can't be parsed, unless you set utc=True? Or convert to UTC automatically?

@jorisvandenbossche
Copy link
Member

We probably also need to consider the impact on reading CSV files if you have a column with mixed timezones (and if deprecating it in to_datetime, there might need to be an equivalent deprecation in read_csv).

@MarcoGorelli
Copy link
Member Author

read_csv keeps a columns as object if it can't be parsed, so I don't think there'd be much of an impact there? other than that the elements would now be ymd hms formatted

yes, I'm suggesting to raise if you have mixed timezones and utc=False

What would you suggest changing default to? If you suggesting utc defaulting to True, then my reservation is that then timezone-naive dates would become UTC-aware ones

In [24]: to_datetime(['2020-01-01T00:00'], utc=True)
Out[24]: DatetimeIndex(['2020-01-01 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

The other issue is complexity, there's a fair bit of it to deal with the mixed-timezone case

If someone really wants mixed timezones in an object Series, then

In [10]: %%timeit
    ...: ts.apply(lambda x: Timestamp(x))
    ...: 
    ...: 
3.15 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

isn't impossibly slow, and they'll have to then use apply to do anything with their new object Series anyway, so this performance would have to be acceptable to them anyway. Note that this is on nearly 1 million rows

@jorisvandenbossche
Copy link
Member

@MarcoGorelli I took a quick look at a profile of the snippet above with and without your patch. And my understanding is that the difference is almost entirely due to _return_parsed_timezone_results, which in the current code seems to be a bottleneck (it's masking out parts, converting them, assigning back into an (object dtype!) array, and then in the end convert the objects back to datetime64 index). That seems quite inefficient, and I would assume that this could be optimized, regardless of the discussion here. I think it should be possible to avoid this costly conversion, especially in the typical case of strings with utc=True that you profiled above.
(one idea, if utc=True, already apply the offset within the loop in array_strptime and localize the result to UTC afterwards (a metadata-change-only tz_localize(UTC), and only keep track of the array of timezones if utc=False. And even for that case, I think it should be possible to re-write _return_parsed_timezone_results to make it less inefficient)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 5, 2023

And to be clear, I don't feel very strongly about this specific feature. I have never used it myself, but can see potential use cases, but it's also true that the workaround (ts.apply(lambda x: Timestamp(x))) is OK (just slower, but then you are using object dtype and thus in slower code paths anyway). And the complexity argument is certainly valid, so if the benefits are only minor, the trade-off seems to gravitate towards not needing to keep this.
My post above is mostly because I was just curious where the performance difference was coming from :)

@jorisvandenbossche
Copy link
Member

read_csv keeps a columns as object if it can't be parsed, so I don't think there'd be much of an impact there?

Currently it returns an object column of Timestamps I think, while after a deprecation it would either raise (not ideal for read_csv?) or return a column of strings (so at least some change in behaviour?)

yes, I'm suggesting to raise if you have mixed timezones and utc=False

What would you suggest changing default to? If you suggesting utc defaulting to True, then my reservation is that then timezone-naive dates would become UTC-aware ones

Yeah, we of course don't want to change that if there are only naive timestamps to start with. A potential default could rather be something like "if we detect multiple timezones, return UTC values".

@MarcoGorelli
Copy link
Member Author

Thanks for looking into it!

I'd rather not introduce extra complexity to array_strptime, it's already a complex-ish part of pandas with not many people taking a look at it

Note that mixed-timezone timestamps already raises

>>> to_datetime([Timestamp('2020-01-01 00:00').tz_localize('Europe/Brussels'), Timestamp('2020-01-02 00:00').tz_localize('Asia/Kathmandu')])
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True, at position 1

so we'd just be making the string case consistent with that

Currently it returns an object column of Timestamps

You're right, thanks - so yes, this would be a behaviour change, as it would no longer parse that column, it would leave it as-is

@jorisvandenbossche
Copy link
Member

As other reference, pyarrow's read_csv will by default return UTC whenever you have timezone offsets (so for mixed offsets, but also if you have a constant offset).

@MarcoGorelli
Copy link
Member Author

thanks for the reference - looks like they do that for strptime as well, so they don't return a non-TimestampType column either

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: dates = ['2020-01-01T00:00+01:00', '2020-01-02T00:00+02:00']

In [4]: table = pa.table({'arr': pa.array(dates, 'string')})

In [5]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us').type
Out[5]: TimestampType(timestamp[us])

In [6]: pc.strptime(table.column('arr'), format='%Y-%m-%dT%H:%M%z', unit='us')
Out[6]:
<pyarrow.lib.ChunkedArray object at 0x7f78f59e5ad0>
[
  [
    2019-12-31 23:00:00.000000,
    2020-01-01 22:00:00.000000
  ]
]

@jorisvandenbossche
Copy link
Member

Just to illustrate how inefficient the current implementation was for utc=True, the following small patch avoids a roundtrip through object dtype for that case:

diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index 74210a1ce5..778081c32c 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -333,17 +333,22 @@ def _return_parsed_timezone_results(
     -------
     tz_result : Index-like of parsed dates with timezone
     """
-    tz_results = np.empty(len(result), dtype=object)
+    if utc:
+        tz_results = np.empty(len(result), dtype="datetime64[ns]")
+    else:
+        tz_results = np.empty(len(result), dtype=object)
     for zone in unique(timezones):
         mask = timezones == zone
         dta = DatetimeArray(result[mask]).tz_localize(zone)
         if utc:
             if dta.tzinfo is None:
-                dta = dta.tz_localize("utc")
+                dta = dta.tz_localize("utc")._ndarray
             else:
-                dta = dta.tz_convert("utc")
+                dta = dta.tz_convert("utc")._ndarray
         tz_results[mask] = dta
 
+    if utc:
+        tz_results = DatetimeArray(tz_results).tz_localize("utc")
     return Index(tz_results, name=name)

and with that the timing of the benchmark from #50887 (comment) for me goes from 2.2s to 800ms.

If someone really wants mixed timezones in an object Series, then ... ts.apply(lambda x: Timestamp(x))

I liked that workaround, and was thinking that we can mention this in the deprecation (or future error) message. However, one problem with it is that this doesn't give you a way to specify a format (or one of the other keywords of to_datetime).

And the more general ts.apply(lambda x: pd.to_datetime(x)) is unfortunately horribly slow .. (10min vs 4s with Timestamp on the benchmark case)

looks like they [pyarrow] do that for strptime as well, so they don't return a non-TimestampType column either

There actually seems to be a bug there, in that it returns a plain Timestamp type without tz, while the input has timezones. I would expect it to at least return a tz=UTC timestamp type, just like their read_csv does.
Thanks for showing that example ;) Opened apache/arrow#35448 (and have a PR to fix it)

@MarcoGorelli
Copy link
Member Author

wow, nice! so pyarrow parses as TimestampType(timestamp[us]) if elements are tz-naive, and TimestampType(timestamp[us, tz=UTC]) if they're tz-aware?

that's beautiful, could we do that in pandas and get rid of the utc kwarg? ending up with a fixed-offset timezone like 'datetime64[ns, UTC+01:00]' is mostly useless anyway, what you really want as a user is an 'Area/Location' timezone anyway

@jbrockmendel @mroeschke do you have thoughts on mirroring pyarrow here?

@mroeschke
Copy link
Member

I guess I won't die on the mixed tz offset case hill and would be okay with returning UTC. A UserWarning would be nice noting if multiple different offsets were detected. What about if the offsets were all the same?

It would be nice to remove the utc kwarg from to_datetime too.

@MarcoGorelli
Copy link
Member Author

What about if the offsets were all the same?

I'd stay consistent with pyarrow here and just convert to utc anyway - simple and predictable

@MarcoGorelli
Copy link
Member Author

From #53250 , it seems that we can't really convert everything to UTC

So, going back to the initial suggestion of just disallowing mixed timezones, will see if I can elaborate

@MarcoGorelli
Copy link
Member Author

outcome of today's call: people are generally in favour of raising in such a case, and advising users to pass utc=True

@natmokval
Copy link
Contributor

I would like to work on this.

@lafrech
Copy link

lafrech commented Sep 4, 2023

I guess this can be closed since #54014 was merged.

@MarcoGorelli
Copy link
Member Author

Yup thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Deprecate Functionality to remove in pandas
Projects
None yet
Development

No branches or pull requests

6 participants