Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings #50939

Merged
merged 22 commits into from
Feb 14, 2023

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Jan 23, 2023

@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 23, 2023 13:19
@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Jan 23, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does format="ISO8601" interact with exact=True/False?

@MarcoGorelli
Copy link
Member Author

It's ignored - good point, I'll make sure that's in the docstring

@jorisvandenbossche
Copy link
Member

How does format="ISO8601" interact with exact=True/False?

It's ignored - good point, I'll make sure that's in the docstring

Could we raise an error instead? (since the default is OK, it's only when the user explicitly passes exact=False that this can occur, so an error if you pass both format="ISO8601", exact=False seems fine?)

@jorisvandenbossche
Copy link
Member

One question that is not directly clear to me from the doc update (without diving into the code): you can pass format="ISO8601" explicitly, but do we also "infer" this format? (i.e. if the inferred format is a subset of ISO8601, use the flexible one as inferred format)

I assume not (and we probably also wouldn't want to do that, I think? Although it would be more backwards compatible), but I was wondering if we can make this clearer for the cases that would benefit from this (like pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])) that work fine on pandas < 2, but error now.
Right now, that case gives a "... doesn't match format ..." error message. Ideally we could point people to this option. In theory we could check if the inferred format is ISO-like and infer the format from the non-matching string, and if that is also ISO like, append the error message. But that's maybe making it a bit too complicated not worth this corner case ;)

@@ -759,13 +766,15 @@ def to_datetime(
<https://docs.python.org/3/library/datetime.html
#strftime-and-strptime-behavior>`_ for more information on choices, though
note that :const:`"%f"` will parse all the way up to nanoseconds.
You can also pass "ISO8601" to parse any ISO8601 time string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link that explains ISO8601? (eg https://en.wikipedia.org/wiki/ISO_8601)

Although looking at that link makes me wonder if we then should be explicit about that this is only the calendar date + time formatting (and not week dates or ordinal dates)

@MarcoGorelli
Copy link
Member Author

but do we also "infer" this format? (i.e. if the inferred format is a subset of ISO8601, use the flexible one as inferred format)
I assume not

I wasn't planning on doing that, as it would strike me as value-dependent behaviour

Good point on the error message though, probably worth putting extra effort in to making sure that's as clear and actionable as possible

@MarcoGorelli MarcoGorelli marked this pull request as draft January 24, 2023 18:21
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 25, 2023 10:45
@MarcoGorelli MarcoGorelli marked this pull request as draft January 25, 2023 19:11
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 25, 2023 21:06
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MarcoGorelli
Copy link
Member Author

LGTM

Thanks for your review!

@jorisvandenbossche any objections?

@mroeschke mroeschke added this to the 2.0 milestone Feb 8, 2023
@mroeschke
Copy link
Member

Does this close #50972 as well?

@MarcoGorelli
Copy link
Member Author

yup, thanks!

@@ -186,6 +186,7 @@ def array_strptime(
bint iso_format = format_is_iso(fmt)
NPY_DATETIMEUNIT out_bestunit
int out_local = 0, out_tzoffset = 0
bint string_to_dts_succeeded = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this is changed from failed? doesn't really matter to me, just curious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just simplifies the logic

@@ -2086,7 +2108,7 @@ def test_dataframe_coerce(self, cache):

msg = (
r'^cannot assemble the datetimes: time data ".+" doesn\'t '
r'match format "%Y%m%d", at position 1$'
r'match format "%Y%m%d", at position 1.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBD but should the period be escaped here (and below)?

@jbrockmendel
Copy link
Member

Couple of small questions, none of them blockers. LGTM.

@mroeschke
Copy link
Member

Looks like a merge conflict then we can merge

@MarcoGorelli
Copy link
Member Author

Looks like a merge conflict then we can merge

resolved, thanks for your reviews!

If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
to each element individually, e.g. ::
If you still need to parse dates with inconsistent formats, you can use
``format='mixed`` (preferably alongside ``dayfirst``) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "preferably", it depends on your data?

Comment on lines 518 to 523
f"{str(ex)}, at position {i}. You might want to try:\n"
" - passing ``format='ISO8601'`` if your strings are "
"all ISO8601 but not necessarily in exactly the same format;\n"
" - passing ``format='mixed'``, and the format will be "
"inferred for each element individually. "
"You might want to use ``dayfirst`` alongside this.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still want to mention the suggestion to pass a specific format first?

For example, if an error occurs because the format is ambiguous based on the first value (dayfirst/daylast), and was inferred incorrectly:

pd.to_datetime(["12/01/2012", "13/01/2012"])
...
ValueError: time data "13/01/2012" doesn't match format "%m/%d/%Y", at position 1. You might want to try:
    - passing ``format='ISO8601'`` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing ``format='mixed'``, and the format will be inferred for each element individually. You might want to use ``dayfirst`` alongside this.

Here, I think the best is still to provide a manual format="%d/%m/%Y" (or to specify dayfirst=True, then the inference would work for this data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, a fomatting nit: since this is plain text (and not rst that will be rendered), no need for double backticks, I would use single backticks which is a bit less visually distracting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks!

- "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_
time string (not necessarily in exactly the same format);
- "mixed", to infer the format for each element individually. This is risky,
and you should probably use it along with `dayfirst`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that in that case that is best, but the typicaly caveat of that this dayfirst=True/False is not strict still applies, right?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but it would solve the typical example at least

In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], format='mixed', dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just wondering if it would be useful to still call it out here explicitly (but it's quite explicit in the dayfirst entry in the docstring)

@@ -64,6 +64,11 @@ Out[3]:
1 2000-01-13
dtype: datetime64[ns]
```
or, if their dates are all ISO8601,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can update the one above to use format="mixed" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this one - thanks!

@mroeschke mroeschke mentioned this pull request Feb 10, 2023
1 task
@mroeschke
Copy link
Member

@MarcoGorelli merged in main once more and then I think we are good to merge. If there are any outstanding issues they can be addressed during the rc period

@MarcoGorelli
Copy link
Member Author

sounds good, thanks!

@MarcoGorelli
Copy link
Member Author

merging then so this doesn't block the RC, but if there's further requests please do let me know and I'll address them promptly

@MarcoGorelli MarcoGorelli merged commit d1095bc into pandas-dev:main Feb 14, 2023
@jorisvandenbossche
Copy link
Member

Thanks @MarcoGorelli !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
4 participants