-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make work Z-order is on strings with identical prefix of length >= 14 #2844
Comments
This seems to be an issue with how we compute the z-order key. Here's the output of
Every row gets the same z-order key even though they have distinct values in the z-order column Update: This happens because we only look at the first 16 bytes of each column to compute the z-order key. delta-rs/crates/core/src/operations/optimize.rs Lines 1426 to 1431 in de11e6b
I've updated the issue description ("with identical prefixes of at least 14 characters"). This limitation is mentioned in the PR for the original z-order implementation: #1429 (comment) |
@wjones127 The z-order design document recommends an implementation that would avoid the issue of dropping bytes for long strings (decision 1, option 3). It seems we ended up going with option 1 instead, which does have that issue. Do you think option 3 is still a viable approach for us? Any pointers for how to implement this? Another, simpler option would be to make the number of significant bytes per z-order column configurable. |
@cjolowicz feel free to to implement option 3, you could also check which implementation eventually got into delta-spark and work from there. Do you want me to assign this issue to you? |
Environment
Delta-rs version:
0.19.1
Binding:
Python and Rust
Environment:
Bug
What happened:
Apply z-order to a Delta Table on a column that contains strings with identical prefixes of at least 14 characters. The records in the new Parquet files retain their original order.
I initially witnessed this when z-ordering a large partition on ISO 8601 timestamps using delta-rs in Rust. I've since reproduced this with Python bindings and a small data frame using strings containing zero-padded integers (see repro below).
What you expected to happen:
The resulting Parquet files are ordered by the column specified for z-ordering.
How to reproduce it:
Run this with uv:
Output:
More details:
N/A
The text was updated successfully, but these errors were encountered: