Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: arithmetic can break equality-hash invariant for Timestamp during DST transition #60084

Closed
2 of 3 tasks
Kxnr opened this issue Oct 22, 2024 · 15 comments · Fixed by #60346
Closed
2 of 3 tasks

BUG: arithmetic can break equality-hash invariant for Timestamp during DST transition #60084

Kxnr opened this issue Oct 22, 2024 · 15 comments · Fixed by #60346
Assignees
Labels
Bug good first issue hashing hash_pandas_object Needs Tests Unit test(s) needed to prevent regressions Timestamp pd.Timestamp and associated methods
Milestone

Comments

@Kxnr
Copy link

Kxnr commented Oct 22, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# this is immediately following a DST transition
dt_str = "2023-11-05 01:00-08:00"
tz_str = "America/Los_Angeles"

ts1 = pd.Timestamp(dt_str, tz=tz_str)

# This is the shortest way I know of to reproduce--more complex math or `pd.date_range` can also
# create this behavior.
ts2 = ts1 + pd.Timedelta(hours=0)
                                                                                                                                                                                                                               
assert ts1 == ts2  # True
assert hash(ts1) == hash(ts2)  # False

Issue Description

Datetime arithmetic on a Timestamp instance on a DST transition can create two Timestamps that compare equal but have different hash values. Inspecting the example, the cause seems to be that ts1.fold == 0 and ts2.fold == 1, despite otherwise being equal and representing the same instant in time. This breaks the invariant documented in object.__hash__:

object.__hash__(self)
... The only required property is that objects which compare equal have the same hash value; ...

Converting to UTC or using pytz's normalization as ts.tz.normalize(ts) both serve as workarounds for this issue.

Expected Behavior

I expected Timestamp instances that compare equal to have the same hash value.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.6
python-bits : 64
OS : Linux
OS-release : 5.15.153.1-microsoft-standard-WSL2
Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : C.UTF-8

pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.1.2
Cython : None
sphinx : None
IPython : 8.28.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

@Kxnr Kxnr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2024
@rhshadrach
Copy link
Member

Thanks for the report! Agreed this would be a bug, I cannot reproduce on main but can on 2.3.x. Need to run a git bisect to see which PR fixed this and if tests need to be added.

@rhshadrach rhshadrach added hashing hash_pandas_object Timestamp pd.Timestamp and associated methods labels Oct 22, 2024
@Kxnr
Copy link
Author

Kxnr commented Oct 23, 2024

Thanks for taking a look so quickly! I bisected the last 1000 commits and found 96d732e as the first fixed commit, corresponding to #59089. On both this commit and main, I can reproduce by installing with pip install ".[timezone]" but not with a default build

@rhshadrach
Copy link
Member

Thanks! I think this could use a test. Contributions welcome!

@rhshadrach rhshadrach added Needs Tests Unit test(s) needed to prevent regressions good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 24, 2024
@vgauraha62
Copy link

is this issue resolved?

@rhshadrach
Copy link
Member

@vgauraha62 - a test needs to be added in order for the issue to be resolved.

@Kxnr
Copy link
Author

Kxnr commented Oct 25, 2024

I'll take a look at adding a test this weekend; I'll need to get up to speed on how things work around here, so others are welcome to pick this up if they have the bandwidth.

@rhshadrach is there any additional action needed to correct this with the timezone extra?

@rhshadrach
Copy link
Member

is there any additional action needed to correct this with the timezone extra?

Not that I can see, but if you believe so, could you elaborate?

@vgauraha62
Copy link

Please assign this issue to me, I'd add a test in a couple of hours and get this resolved

@vgauraha62
Copy link

@rhshadrach sir are you suggesting that I should create a branch and submit a pull request?

@saldanhad
Copy link
Contributor

Hi @vgauraha62, you can comment take to have this issue self-assigned to you. Please follow the applicable requirements in the PR checklist and remember to reference this issue in your PR.

@vgauraha62
Copy link

take

@Kxnr
Copy link
Author

Kxnr commented Oct 28, 2024

is there any additional action needed to correct this with the timezone extra?

Not that I can see, but if you believe so, could you elaborate?

Apologies, I thought the initial test script replicated the issue but I have to explicitly use a pytz timezone on latest main. The test script below replicates the original issue on main when the timezone extra is installed. This isn't a "real" failure from my usecase, just dotting my i's.

import pandas as pd
import pytz

# this is immediately following a DST transition
dt_str = "2023-11-05 01:00-08:00"
tz_str = "America/Los_Angeles"
pytz_tz = pytz.timezone(tz_str)

ts1 = pd.Timestamp(dt_str, tz=pytz_tz)
ts2 = ts1 + pd.Timedelta(hours=0)
                                                                                                                                                                                                                               
assert ts1 == ts2
assert hash(ts1) == hash(ts2) 
print("hash-equality invariant upheld")

@eightyseven
Copy link
Contributor

take

are you still working on this issue?I'm also interested in this one .

@KevsterAmp
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue hashing hash_pandas_object Needs Tests Unit test(s) needed to prevent regressions Timestamp pd.Timestamp and associated methods
Projects
None yet
6 participants