Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in row encoding: bad output in same cases for lists #20151

Closed
2 tasks done
itamarst opened this issue Dec 4, 2024 · 4 comments · Fixed by #20161
Closed
2 tasks done

Regression in row encoding: bad output in same cases for lists #20151

itamarst opened this issue Dec 4, 2024 · 4 comments · Fixed by #20161
Assignees
Labels
accepted Ready for implementation bug Something isn't working rust Related to Rust Polars

Comments

@itamarst
Copy link
Contributor

itamarst commented Dec 4, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from polars.testing import assert_series_equal

values = [[1, 2], None, [4, 5], [None, 3]]
# A list series:
list_series = pl.Series(values, dtype=pl.List(pl.Int64()))
# An array series with the same values:
array_series = pl.Series(values, dtype=pl.Array(pl.Int64(), 2))
# Convert the array series into a list series:
list_from_array_series = array_series.cast(list_series.dtype)

# These two series are seemingly identical:
assert_series_equal(list_series, list_from_array_series)


# BUT! When we encode them, the encoded result is different:
def row_encode(series: pl.Series) -> pl.Series:
    df = pl.DataFrame({"a": series})
    return df._row_encode([(False, False, True)])


encoded1 = row_encode(list_series)
encoded2 = row_encode(list_from_array_series)
# Using encoded1.dtype or encoded1.to_list() panics in different ways (one
# of them says "entered unreachable code"!!!) However, the _way_ the
# list_from_array_series gets row-encoded as bytes is something I've seen
# with pure Rust that doesn't use DataFrame._row_encode but rather
# encode_rows_unordered() so I think it's pretty low-level issue.
print(encoded1)
print(encoded2)

# Looking at the encoded bytes output from encoded2... it just looks wrong.
# There's no 3 in there, for example.

Log output

We expect the two results to be the same, and yet when we run the above:

shape: (4,)
Series: 'row_enc' [binary[offset]]
[
        b"\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x01\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x02\x01"
        b"\x00"
        b"\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x04\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x05\x01"
        b"\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x03\x01"
]
shape: (4,)
Series: 'row_enc' [binary[offset]]
[
        b"\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x01\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x02\x01"
        b"\x00"
        b"\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01"
        b"\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x04\xfe\x01\x80\x00\x00\x00\x00\x00\x00\x05\x01"
]

Issue description

Two seemingly identical Series ought to have same row encoding, or at the very least semantically matching row encoding. However... they don't. I've seen this behavior with much lower API access, so I don't think it's due to DataFrame._row_encode(), I think it's the low level code, and from manual testing with Rust I suspect this happened in past few weeks.

Expected behavior

The output should identical in both cases.

Installed versions

This is latest git as of December 4, 2024, on Linux.

@itamarst itamarst added bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars labels Dec 4, 2024
@coastalwhite coastalwhite added accepted Ready for implementation and removed needs triage Awaiting prioritization by a maintainer labels Dec 4, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Dec 4, 2024
@coastalwhite coastalwhite self-assigned this Dec 4, 2024
@coastalwhite
Copy link
Collaborator

Yeah, I see now that masked out values are still being used. That is a problem. I will take a look at it.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 4, 2024

This also suggests a need for more thorough unit testing, although to be fair I'm not sure I would ever think to write a test that looks like the reproducer 😀

@coastalwhite
Copy link
Collaborator

I think adding these kinds of list arrays to the parametric tests will help.

The unit tests for the row encoding are quite extensive and both the sorting and round tripping have parametric tests.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 5, 2024

If there's a way to create them directly, yeah, I was just surprised that array->list conversion somehow broke things.

coastalwhite added a commit to coastalwhite/polars that referenced this issue Dec 5, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants