Unintuitive behavior when hashing list[cat]
columns
#14829
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Polars panics when asked to hash a
list[str]
column, but will hash alist[cat]
column without complaining. However, the resulting hash is a function of the physical representation, not the logical representation, which is surprising and unintuitive behavior (and likely not what anybody wants).At the very least, this should be documented. Because the current behavior is unintuitive and rarely what someone might want, raising an error might be better. Best yet, hashing of
list[cat]
columns (or even a list of any type!) could be implemented correctly.Related issues: #10747, #12636, #4175, #2869, #13950
With the string cache off, define two dataframes with identical data but different row order.
We expect that
should be all true, and yet it isn't.
panicks.
However,
is all true, as desired. Ideally,
df.select(pl.col("a").hash())
should give the same result asdf.select(pl.col("a").cast(pl.List(pl.String)).list.eval(pl.element().hash()).hash())
. This was taken from a suggestion by @ion-elgreco (#13950 (comment)).Note that casting from
list[cat]
tolist[str]
is necessary, asis all false.
Installed versions
polars-u64-idx 0.20.13
The text was updated successfully, but these errors were encountered: