Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings #50939

Merged
merged 22 commits into from
Feb 14, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1001,14 +1001,23 @@ way to parse dates is to explicitly set ``format=``.
)
df

In the case that you have mixed datetime formats within the same column, you'll need to
first read it in as an object dtype and then apply :func:`to_datetime` to each element.
In the case that you have mixed datetime formats within the same column, you can
pass ``format='mixed'``

.. ipython:: python

data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
df = pd.read_csv(data)
df['date'] = df['date'].apply(pd.to_datetime)
df['date'] = pd.to_datetime(df['date'], format='mixed')
df

or, if your datetime formats are all ISO8601 (possibly not identically-formatted):

.. ipython:: python

data = io.StringIO("date\n2020-01-01\n2020-01-01 03:00\n")
df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'], format='ISO8601')
df

.. ipython:: python
Expand Down
13 changes: 10 additions & 3 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,8 @@ Other enhancements
- Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`)
- Added :meth:`DatetimeIndex.as_unit` and :meth:`TimedeltaIndex.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`50616`)
- Added new argument ``dtype`` to :func:`read_sql` to be consistent with :func:`read_sql_query` (:issue:`50797`)
- :func:`to_datetime` now accepts ``"ISO8601"`` as an argument to ``format``, which will match any ISO8601 string (but possibly not identically-formatted) (:issue:`50411`)
- :func:`to_datetime` now accepts ``"mixed"`` as an argument to ``format``, which will infer the format for each element individually (:issue:`50972`)
-

.. ---------------------------------------------------------------------------
Expand Down Expand Up @@ -606,11 +608,16 @@ In the past, :func:`to_datetime` guessed the format for each element independent

Note that this affects :func:`read_csv` as well.

If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
to each element individually, e.g. ::
If you still need to parse dates with inconsistent formats, you can use
``format='mixed`` (preferably alongside ``dayfirst``) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "preferably", it depends on your data?


ser = pd.Series(['13-01-2000', '12 January 2000'])
ser.apply(pd.to_datetime)
pd.to_datetime(ser, format='mixed', dayfirst=True)

or, if your formats are all ISO8601 (but possibly not identically-formatted) ::

ser = pd.Series(['2020-01-01', '2020-01-01 03:00'])
pd.to_datetime(ser, format='ISO8601')

.. _whatsnew_200.api_breaking.other:

Expand Down
69 changes: 43 additions & 26 deletions pandas/_libs/tslibs/strptime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ def array_strptime(
bint iso_format = format_is_iso(fmt)
NPY_DATETIMEUNIT out_bestunit
int out_local = 0, out_tzoffset = 0
bint string_to_dts_succeeded = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this is changed from failed? doesn't really matter to me, just curious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just simplifies the logic


assert is_raise or is_ignore or is_coerce

Expand Down Expand Up @@ -306,53 +307,62 @@ def array_strptime(
else:
val = str(val)

if iso_format:
string_to_dts_failed = string_to_dts(
if fmt == "ISO8601":
string_to_dts_succeeded = not string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, None, False
)
elif iso_format:
string_to_dts_succeeded = not string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, fmt, exact
)
if not string_to_dts_failed:
# No error reported by string_to_dts, pick back up
# where we left off
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
if out_local == 1:
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
tz = timezone(timedelta(minutes=out_tzoffset))
result_timezone[i] = tz
out_local = 0
out_tzoffset = 0
iresult[i] = value
check_dts_bounds(&dts)
continue
if string_to_dts_succeeded:
# No error reported by string_to_dts, pick back up
# where we left off
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
if out_local == 1:
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
tz = timezone(timedelta(minutes=out_tzoffset))
result_timezone[i] = tz
out_local = 0
out_tzoffset = 0
iresult[i] = value
check_dts_bounds(&dts)
continue

if parse_today_now(val, &iresult[i], utc):
continue

# Some ISO formats can't be parsed by string_to_dts
# For example, 6-digit YYYYMD. So, if there's an error,
# try the string-matching code below.
# For example, 6-digit YYYYMD. So, if there's an error, and a format
# was specified, then try the string-matching code below. If the format
# specified was 'ISO8601', then we need to error, because
# only string_to_dts handles mixed ISO8601 formats.
if not string_to_dts_succeeded and fmt == "ISO8601":
raise ValueError(f"Time data {val} is not ISO8601 format")

# exact matching
if exact:
found = format_regex.match(val)
if not found:
raise ValueError(f"time data \"{val}\" doesn't "
f"match format \"{fmt}\"")
raise ValueError(
f"time data \"{val}\" doesn't match format \"{fmt}\""
)
if len(val) != found.end():
raise ValueError(
f"unconverted data remains: "
f'"{val[found.end():]}"'
"unconverted data remains when parsing with "
f"format \"{fmt}\": \"{val[found.end():]}\""
)

# search
else:
found = format_regex.search(val)
if not found:
raise ValueError(
f"time data \"{val}\" doesn't match "
f"format \"{fmt}\""
f"time data \"{val}\" doesn't match format \"{fmt}\""
)

iso_year = -1
Expand Down Expand Up @@ -504,7 +514,14 @@ def array_strptime(
result_timezone[i] = tz

except (ValueError, OutOfBoundsDatetime) as ex:
ex.args = (f"{str(ex)}, at position {i}",)
ex.args = (
f"{str(ex)}, at position {i}. You might want to try:\n"
" - passing ``format='ISO8601'`` if your strings are "
"all ISO8601 but not necessarily in exactly the same format;\n"
" - passing ``format='mixed'``, and the format will be "
"inferred for each element individually. "
"You might want to use ``dayfirst`` alongside this.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still want to mention the suggestion to pass a specific format first?

For example, if an error occurs because the format is ambiguous based on the first value (dayfirst/daylast), and was inferred incorrectly:

pd.to_datetime(["12/01/2012", "13/01/2012"])
...
ValueError: time data "13/01/2012" doesn't match format "%m/%d/%Y", at position 1. You might want to try:
    - passing ``format='ISO8601'`` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing ``format='mixed'``, and the format will be inferred for each element individually. You might want to use ``dayfirst`` alongside this.

Here, I think the best is still to provide a manual format="%d/%m/%Y" (or to specify dayfirst=True, then the inference would work for this data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, a fomatting nit: since this is plain text (and not rst that will be rendered), no need for double backticks, I would use single backticks which is a bit less visually distracting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks!

)
if is_coerce:
iresult[i] = NPY_NAT
continue
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/tools/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,10 +442,11 @@ def _convert_listlike_datetimes(

arg = ensure_object(arg)

if format is None:
if format is None and format != "mixed":
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

if format is not None:
# `format` could not be inferred, or user asked for mixed-format parsing.
if format is not None and format != "mixed":
return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)

result, tz_parsed = objects_to_datetime64ns(
Expand Down Expand Up @@ -687,7 +688,7 @@ def to_datetime(
yearfirst: bool = False,
utc: bool = False,
format: str | None = None,
exact: bool = True,
exact: bool | lib.NoDefault = lib.no_default,
unit: str | None = None,
infer_datetime_format: lib.NoDefault | bool = lib.no_default,
origin: str = "unix",
Expand Down Expand Up @@ -759,13 +760,20 @@ def to_datetime(
<https://docs.python.org/3/library/datetime.html
#strftime-and-strptime-behavior>`_ for more information on choices, though
note that :const:`"%f"` will parse all the way up to nanoseconds.
You can also pass:

- "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_
time string (not necessarily in exactly the same format);
- "mixed", to infer the format for each element individually. This is risky,
and you should probably use it along with `dayfirst`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that in that case that is best, but the typicaly caveat of that this dayfirst=True/False is not strict still applies, right?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but it would solve the typical example at least

In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], format='mixed', dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just wondering if it would be useful to still call it out here explicitly (but it's quite explicit in the dayfirst entry in the docstring)

exact : bool, default True
Control how `format` is used:

- If :const:`True`, require an exact `format` match.
- If :const:`False`, allow the `format` to match anywhere in the target
string.

Cannot be used alongside ``format='ISO8601'`` or ``format='mixed'``.
unit : str, default 'ns'
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
integer or float number. This will be based off the origin.
Expand Down Expand Up @@ -997,6 +1005,8 @@ def to_datetime(
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2020-01-01 18:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
"""
if exact is not lib.no_default and format in {"mixed", "ISO8601"}:
raise ValueError("Cannot use 'exact' when 'format' is 'mixed' or 'ISO8601'")
if infer_datetime_format is not lib.no_default:
warnings.warn(
"The argument 'infer_datetime_format' is deprecated and will "
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/io/parser/test_parse_dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -1724,7 +1724,8 @@ def test_parse_multiple_delimited_dates_with_swap_warnings():
with pytest.raises(
ValueError,
match=(
r'^time data "31/05/2000" doesn\'t match format "%m/%d/%Y", at position 1$'
r'^time data "31/05/2000" doesn\'t match format "%m/%d/%Y", '
r"at position 1. You might want to try:\n - passing ``format='ISO8601'``"
),
):
pd.to_datetime(["01/01/2000", "31/05/2000", "31/05/2001", "01/02/2000"])
Expand Down
Loading