Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: replacing out of bound datetimes is not possible #36782

Closed
3 tasks done
krassowski opened this issue Oct 1, 2020 · 7 comments · Fixed by #45525
Closed
3 tasks done

BUG: replacing out of bound datetimes is not possible #36782

krassowski opened this issue Oct 1, 2020 · 7 comments · Fixed by #45525
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@krassowski
Copy link

krassowski commented Oct 1, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

from datetime import datetime
from pandas import DataFrame

# typo in the data entry, should have been 2020!
df = DataFrame({'x': [datetime(2920, 10, 1)]})
# let's try to fix that in code:
df.x.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)})
# or
df.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)})

Raises:

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2920-10-01 00:00:00
Full traceback details
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-1-a4bdb67c8945> in <module>
      5 df = DataFrame({'x': [datetime(2920, 10, 1)]})
      6 # let's try to fix that in code:
----> 7 df.x.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)})

/site-packages/pandas/core/series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4561         method="pad",
   4562     ):
-> 4563         return super().replace(
   4564             to_replace=to_replace,
   4565             value=value,

/site-packages/pandas/core/generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6495                 to_replace, value = keys, values
   6496 
-> 6497             return self.replace(
   6498                 to_replace, value, inplace=inplace, limit=limit, regex=regex
   6499             )

/site-packages/pandas/core/series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4561         method="pad",
   4562     ):
-> 4563         return super().replace(
   4564             to_replace=to_replace,
   4565             value=value,

/site-packages/pandas/core/generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6538                         )
   6539                     self._consolidate_inplace()
-> 6540                     new_data = self._mgr.replace_list(
   6541                         src_list=to_replace,
   6542                         dest_list=value,

/site-packages/pandas/core/internals/managers.py in replace_list(self, src_list, dest_list, inplace, regex)
    640         mask = ~isna(values)
    641 
--> 642         masks = [comp(s, mask, regex) for s in src_list]
    643 
    644         result_blocks = []

/site-packages/pandas/core/internals/managers.py in <listcomp>(.0)
    640         mask = ~isna(values)
    641 
--> 642         masks = [comp(s, mask, regex) for s in src_list]
    643 
    644         result_blocks = []

/site-packages/pandas/core/internals/managers.py in comp(s, mask, regex)
    633                 return ~mask
    634 
--> 635             s = com.maybe_box_datetimelike(s)
    636             return _compare_or_regex_search(values, s, regex, mask)
    637 

/site-packages/pandas/core/common.py in maybe_box_datetimelike(value, dtype)
     88 
     89     if isinstance(value, (np.datetime64, datetime)):
---> 90         value = tslibs.Timestamp(value)
     91     elif isinstance(value, (np.timedelta64, timedelta)):
     92         value = tslibs.Timedelta(value)

pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject()

pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()

Problem description

It is useful to be able to replace the dates which are out of bound, because these are not supported by pandas. However, it is currently difficult because replace no longer works for them. This is a regression as the above code worked well in pandas 1.0.4, but does not work in pandas 1.1.2 nor on master.

Expected Output

df should be equal to DataFrame({'x': [datetime(2020, 10, 1)]})

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-48-generic
Version : #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.2
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.3.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.49.0

@krassowski krassowski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2020
@krassowski
Copy link
Author

Related to #34186.

@krassowski
Copy link
Author

krassowski commented Oct 1, 2020

Workaround:

df.applymap(lambda x: datetime(2020, 10, 1) if x == datetime(2920, 10, 1) else x)

I would expect replace() to work like the above workaround.

@torsive
Copy link

torsive commented Oct 1, 2020

Hello, I'm interested in solving this bug, I would like to ask a couple of question regarding the behaviour of workaround code and whether these behaviours are expected in the final solution for the bug.

Code:

from datetime import datetime
from pandas import DataFrame

# typo in the data entry, should have been 2020!
df = DataFrame({'x': [datetime(2920, 10, 1)]})

'''
Trying to fix the data entry using the workaround
I also found a small typo in your code, 
I belive it should be x == datetime(2920, 10, 1) not x == datetime(2092, 10, 1) as mentioned in work around 
'''
df.applymap(lambda x: datetime(2020, 10, 1) if x == datetime(2920, 10, 1) else x)
df.x

Output:

0    2920-10-01 00:00:00
Name: x, dtype: object

After reassigning df.applymap(lambda x: datetime(2020, 10, 1) if x == datetime(2920, 10, 1) else x) to df I got follwoing output

Code:

from datetime import datetime
from pandas import DataFrame

# typo in the data entry, should have been 2020!
df = DataFrame({'x': [datetime(2920, 10, 1)]})

# trying to assign value of df.applymap into df 
df = df.applymap(lambda x: datetime(2020, 10, 1) if x == datetime(2920, 10, 1) else x)
df.x

Ouput:

0   2020-10-01
Name: x, dtype: datetime64[ns]

I would like to know whether the change in dtype of x from object to datetime64[ns] is part of the intended solution to this bug or dtype of df.x is supposed to remain object and conversion to datetimen64 is an unintended side effect of a workaround?

I would also like to ask about the loss of precision, as initially df.x has also displayed time values of DateTime 2920-10-01 00:00:00

While after reassignment I found that time value is not included 2020-10-01. I would like to know whether it is an unintended behaviour of workaround or part of the intended solution?

Output of pd.show_versions()

details

Installed Versions

commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python           : 3.6.9.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.112+
Version          : #1 SMP Thu Jul 23 08:00:38 PDT 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.2
numpy            : 1.18.5
pytz             : 2018.9
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 50.3.0
Cython           : 0.29.21
pytest           : 3.6.4
hypothesis       : None
sphinx           : 1.8.5
blosc            : None
feather          : 0.4.1
xlsxwriter       : None
lxml.etree       : 4.2.6
html5lib         : 1.0.1
pymysql          : None
psycopg2         : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 5.5.0
pandas_datareader: 0.9.0
bs4              : 4.6.3
bottleneck       : 1.3.2
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.9
pandas_gbq       : 0.13.2
pyarrow          : 0.14.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.19
tables           : 3.4.4
tabulate         : 0.8.7
xarray           : 0.15.1
xlrd             : 1.1.0
xlwt             : 1.3.0
numba            : 0.48.0

@torsive
Copy link

torsive commented Oct 1, 2020

take

@krassowski
Copy link
Author

In pandas 1.0.4:

df = DataFrame({'x': [datetime(2920, 10, 1)]})
df.dtypes                                                                                                                                                                                                                                                                                                                                                              

Out:

x    object
dtype: object
df.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)}).dtypes

Out:

x    datetime64[ns]
dtype: object

Thus, the workaround has the same side-effect as was the behaviour of replace() in the previous version.

df.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)}).dtypes

@torsive
Copy link

torsive commented Oct 2, 2020

Cool, I'll look into it.

@mroeschke
Copy link
Member

This looks to work on master. Could use a test

In [9]: from datetime import datetime
   ...: from pandas import DataFrame
   ...:
   ...: # typo in the data entry, should have been 2020!
   ...: df = DataFrame({'x': [datetime(2920, 10, 1)]})

In [10]: df
Out[10]:
                     x
0  2920-10-01 00:00:00

In [11]: df.x.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)})
Out[11]:
0   2020-10-01
Name: x, dtype: datetime64[ns]

In [12]: df
Out[12]:
                     x
0  2920-10-01 00:00:00

In [13]: df.replace({datetime(2920, 10, 1): datetime(2020, 10, 1)})
Out[13]:
           x
0 2020-10-01

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 13, 2021
@jreback jreback added this to the 1.5 milestone Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants