Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unintuitive behavior when hashing list[cat] columns #14829

Open
2 tasks done
shenker opened this issue Mar 3, 2024 · 2 comments
Open
2 tasks done

Unintuitive behavior when hashing list[cat] columns #14829

shenker opened this issue Mar 3, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@shenker
Copy link
Contributor

shenker commented Mar 3, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Polars panics when asked to hash a list[str] column, but will hash a list[cat] column without complaining. However, the resulting hash is a function of the physical representation, not the logical representation, which is surprising and unintuitive behavior (and likely not what anybody wants).

At the very least, this should be documented. Because the current behavior is unintuitive and rarely what someone might want, raising an error might be better. Best yet, hashing of list[cat] columns (or even a list of any type!) could be implemented correctly.

Related issues: #10747, #12636, #4175, #2869, #13950

With the string cache off, define two dataframes with identical data but different row order.

df = pl.DataFrame({"a": [["x","y","z"],["x","y","z"],["z","z","z"]]}, schema=dict(a=pl.List(pl.Categorical)))
df2 = pl.DataFrame({"a": [["z","z","z"],["x","y","z"],["x","y","z"]]}, schema=dict(a=pl.List(pl.Categorical)))

We expect that

df.select(pl.col("a").hash()) == df2.select(pl.col("a").hash())[::-1]

should be all true, and yet it isn't.

df.select(pl.col("a").cast(pl.List(pl.String)).hash())

panicks.

However,

df.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash()) == df2.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash())[::-1]

is all true, as desired. Ideally, df.select(pl.col("a").hash()) should give the same result as df.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash()). This was taken from a suggestion by @ion-elgreco (#13950 (comment)).

Note that casting from list[cat] to list[str] is necessary, as

df.select(pl.col("a").list.eval(pl.element().hash()).hash()) == df2.select(pl.col("a").list.eval(pl.element().hash()).hash())[::-1]

is all false.

Installed versions

polars-u64-idx 0.20.13

@shenker shenker added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 3, 2024
@ritchie46
Copy link
Member

which is surprising and unintuitive behavior (and likely not what anybody wants).

It is. Because we want tho Eq and Hash symmetry to hold. However, you should only compare hashes that come from the same underlying categorical rev map.

@mcrumiller
Copy link
Contributor

Here is a simple example showing the confusion in a more obvious way:

>>> s1 = pl.Series(["a"], dtype=pl.Categorical)
>>> s2 = pl.Series(["b"], dtype=pl.Categorical)
>>> s1 == s2
False
>>> s1.hash() == s2.hash()
True

Here Eq and Hash symmetry does not hold. @ritchie46's "However" covers it, but it still may be a source of confusion, as the reason for the hash equality is that the underlying implementation of categorical happens to hash to the same value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants