Unintuitive behavior when hashing `list[cat]` columns #14829

shenker · 2024-03-03T19:45:49Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Polars panics when asked to hash a list[str] column, but will hash a list[cat] column without complaining. However, the resulting hash is a function of the physical representation, not the logical representation, which is surprising and unintuitive behavior (and likely not what anybody wants).

At the very least, this should be documented. Because the current behavior is unintuitive and rarely what someone might want, raising an error might be better. Best yet, hashing of list[cat] columns (or even a list of any type!) could be implemented correctly.

Related issues: #10747, #12636, #4175, #2869, #13950

With the string cache off, define two dataframes with identical data but different row order.

df = pl.DataFrame({"a": [["x","y","z"],["x","y","z"],["z","z","z"]]}, schema=dict(a=pl.List(pl.Categorical)))
df2 = pl.DataFrame({"a": [["z","z","z"],["x","y","z"],["x","y","z"]]}, schema=dict(a=pl.List(pl.Categorical)))

We expect that

df.select(pl.col("a").hash()) == df2.select(pl.col("a").hash())[::-1]

should be all true, and yet it isn't.

df.select(pl.col("a").cast(pl.List(pl.String)).hash())

panicks.

However,

df.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash()) == df2.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash())[::-1]

is all true, as desired. Ideally, df.select(pl.col("a").hash()) should give the same result as df.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash()). This was taken from a suggestion by @ion-elgreco (#13950 (comment)).

Note that casting from list[cat] to list[str] is necessary, as

df.select(pl.col("a").list.eval(pl.element().hash()).hash()) == df2.select(pl.col("a").list.eval(pl.element().hash()).hash())[::-1]

is all false.

Installed versions

polars-u64-idx 0.20.13

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-03-04T06:53:38Z

which is surprising and unintuitive behavior (and likely not what anybody wants).

It is. Because we want tho Eq and Hash symmetry to hold. However, you should only compare hashes that come from the same underlying categorical rev map.

mcrumiller · 2024-03-04T21:59:47Z

Here is a simple example showing the confusion in a more obvious way:

>>> s1 = pl.Series(["a"], dtype=pl.Categorical)
>>> s2 = pl.Series(["b"], dtype=pl.Categorical)
>>> s1 == s2
False
>>> s1.hash() == s2.hash()
True

Here Eq and Hash symmetry does not hold. @ritchie46's "However" covers it, but it still may be a source of confusion, as the reason for the hash equality is that the underlying implementation of categorical happens to hash to the same value.

shenker added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unintuitive behavior when hashing `list[cat]` columns #14829

Unintuitive behavior when hashing `list[cat]` columns #14829

shenker commented Mar 3, 2024 •

edited

Loading

ritchie46 commented Mar 4, 2024

mcrumiller commented Mar 4, 2024

Unintuitive behavior when hashing list[cat] columns #14829

Unintuitive behavior when hashing list[cat] columns #14829

Comments

shenker commented Mar 3, 2024 • edited Loading

Checks

Reproducible example

Installed versions

ritchie46 commented Mar 4, 2024

mcrumiller commented Mar 4, 2024

Unintuitive behavior when hashing `list[cat]` columns #14829

Unintuitive behavior when hashing `list[cat]` columns #14829

shenker commented Mar 3, 2024 •

edited

Loading