-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datetime parsing (PDEP-4): allow mixture of ISO formatted strings #50939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does format="ISO8601"
interact with exact=True/False
?
It's ignored - good point, I'll make sure that's in the docstring |
Could we raise an error instead? (since the default is OK, it's only when the user explicitly passes |
One question that is not directly clear to me from the doc update (without diving into the code): you can pass I assume not (and we probably also wouldn't want to do that, I think? Although it would be more backwards compatible), but I was wondering if we can make this clearer for the cases that would benefit from this (like |
pandas/core/tools/datetimes.py
Outdated
@@ -759,13 +766,15 @@ def to_datetime( | |||
<https://docs.python.org/3/library/datetime.html | |||
#strftime-and-strptime-behavior>`_ for more information on choices, though | |||
note that :const:`"%f"` will parse all the way up to nanoseconds. | |||
You can also pass "ISO8601" to parse any ISO8601 time string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link that explains ISO8601? (eg https://en.wikipedia.org/wiki/ISO_8601)
Although looking at that link makes me wonder if we then should be explicit about that this is only the calendar date + time formatting (and not week dates or ordinal dates)
I wasn't planning on doing that, as it would strike me as value-dependent behaviour Good point on the error message though, probably worth putting extra effort in to making sure that's as clear and actionable as possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for your review! @jorisvandenbossche any objections? |
Does this close #50972 as well? |
yup, thanks! |
@@ -186,6 +186,7 @@ def array_strptime( | |||
bint iso_format = format_is_iso(fmt) | |||
NPY_DATETIMEUNIT out_bestunit | |||
int out_local = 0, out_tzoffset = 0 | |||
bint string_to_dts_succeeded = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason this is changed from failed? doesn't really matter to me, just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it just simplifies the logic
@@ -2086,7 +2108,7 @@ def test_dataframe_coerce(self, cache): | |||
|
|||
msg = ( | |||
r'^cannot assemble the datetimes: time data ".+" doesn\'t ' | |||
r'match format "%Y%m%d", at position 1$' | |||
r'match format "%Y%m%d", at position 1.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD but should the period be escaped here (and below)?
Couple of small questions, none of them blockers. LGTM. |
Looks like a merge conflict then we can merge |
resolved, thanks for your reviews! |
doc/source/whatsnew/v2.0.0.rst
Outdated
If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime` | ||
to each element individually, e.g. :: | ||
If you still need to parse dates with inconsistent formats, you can use | ||
``format='mixed`` (preferably alongside ``dayfirst``) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this "preferably", it depends on your data?
pandas/_libs/tslibs/strptime.pyx
Outdated
f"{str(ex)}, at position {i}. You might want to try:\n" | ||
" - passing ``format='ISO8601'`` if your strings are " | ||
"all ISO8601 but not necessarily in exactly the same format;\n" | ||
" - passing ``format='mixed'``, and the format will be " | ||
"inferred for each element individually. " | ||
"You might want to use ``dayfirst`` alongside this.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might still want to mention the suggestion to pass a specific format
first?
For example, if an error occurs because the format is ambiguous based on the first value (dayfirst/daylast), and was inferred incorrectly:
pd.to_datetime(["12/01/2012", "13/01/2012"])
...
ValueError: time data "13/01/2012" doesn't match format "%m/%d/%Y", at position 1. You might want to try:
- passing ``format='ISO8601'`` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing ``format='mixed'``, and the format will be inferred for each element individually. You might want to use ``dayfirst`` alongside this.
Here, I think the best is still to provide a manual format="%d/%m/%Y"
(or to specify dayfirst=True
, then the inference would work for this data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, a fomatting nit: since this is plain text (and not rst that will be rendered), no need for double backticks, I would use single backticks which is a bit less visually distracting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thanks!
- "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ | ||
time string (not necessarily in exactly the same format); | ||
- "mixed", to infer the format for each element individually. This is risky, | ||
and you should probably use it along with `dayfirst`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that in that case that is best, but the typicaly caveat of that this dayfirst=True/False
is not strict still applies, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, but it would solve the typical example at least
In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], format='mixed', dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was just wondering if it would be useful to still call it out here explicitly (but it's quite explicit in the dayfirst
entry in the docstring)
@@ -64,6 +64,11 @@ Out[3]: | |||
1 2000-01-13 | |||
dtype: datetime64[ns] | |||
``` | |||
or, if their dates are all ISO8601, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can update the one above to use format="mixed"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this one - thanks!
@MarcoGorelli merged in main once more and then I think we are good to merge. If there are any outstanding issues they can be addressed during the rc period |
sounds good, thanks! |
merging then so this doesn't block the RC, but if there's further requests please do let me know and I'll address them promptly |
Thanks @MarcoGorelli ! |
format='mixed'
in to_datetime #50972doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.