-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug while writing parquet with empty lists of structs #1166
Conversation
Fix a bug in the definition level calculation for fields nested within a struct and a list. When a list is empty or null in parquet the nested field gets a null value. However, in arrow, the value is simply missing. When serializing an immediate child of the list, the list offsets are used to calculate the correct definition level for its children, but it is not carried further to fields nested deeper (e.g., fields on a struct within a list). This (somewhat hacky) fix treats a struct within a list as if it were a list.
608d211
to
33b6543
Compare
Codecov Report
@@ Coverage Diff @@
## master #1166 +/- ##
==========================================
+ Coverage 82.55% 82.59% +0.04%
==========================================
Files 173 173
Lines 50673 50753 +80
==========================================
+ Hits 41833 41921 +88
+ Misses 8840 8832 -8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert in this area -- I would definitely appreciate a review from @nevi-me (or @tustvold as I blieve he has been cursing studying repetition and definition levels)
However, even if this case has some other subtlety we missed, I think the added coverage in this PR means the overall code is better than without so I approve.
Thank you @helgikrs -- very helpful to have someone else looking at this stuff. 🏆
let list_level = | ||
&batch_level.calculate_array_levels(rb.column(0), rb.schema().field(0))[0]; | ||
|
||
let expected_level = LevelInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if someone else who knew this better than I could double check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super confident in this either--it would be great if someone with knowledge about the details of this code could chime in.
The definition and repetition levels I compared with what the c++ parquet writer produces. I exported the above record batch and used the C++ parquet writer to generate a parquet file. I then used parquet-dump
on the resulting file, which produced the following
value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm, these numbers check out 👍
.append_value(2) | ||
.unwrap(); | ||
values.append(true).unwrap(); | ||
list_builder.append(true).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that the code matches the comments about what structure is intended to be created
cc @mosyp and @chadbrewbaker |
Thanks @helgikrs ! |
Which issue does this PR close?
Closes #703
What changes are included in this PR?
Fix a bug in the definition level calculation for fields nested within a struct and a list. When a list is empty or null in parquet the nested field gets a null value. However, in arrow, the value is simply missing. When serializing an immediate child of the list, the list offsets are used to calculate the correct definition level for its children, but it is not carried further to fields nested deeper (e.g., fields on a struct within a list). This (somewhat hacky) fix treats a struct within a list as if it were a list.
Are there any user-facing changes?
No.