-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Ensure page statistics are written only when conifgured from the Arrow Writer #5181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me thank you
)) | ||
} | ||
_ => None, | ||
let page_statistics = if let (Some(min), Some(max)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code it would appear the way this was intended to work is for the encoder itself to not compute page statistics unless EnabledStatistics::Page. If EnabledStatistics::Chunk
the column stats are computed at a higher-level in write_batch_internal. What do you think about pushing this condition into ByteArrayEncoder
like it is for ColumnValueEncoderImpl
?
Edit: I'll have a play to see if I can't simplify this as a follow up, the current behaviour is rather confusing imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's this branch to calculate chunk statistics directly when EnabledStatistics::Chunk
. I think that just having it as the default (For both Page
and Chunk
) will probably simplify the code as you don't have to keep track of the chunk-level metadata when adding pages, but it might require a bit more work which is why I didn't end up going that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was thinking also. I tried this out in #5183 perhaps you could take a look
Which issue does this PR close?
Closes #5162.
Rationale for this change
Currently, he ArrowWriter writes page-level statistics for byte-array-encoded types whether its configured to do so or not.
What changes are included in this PR?
Are there any user-facing changes?
No