-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No longer write Parquet column metadata after column chunks *and* in the footer #6117
Changes from all commits
741bbf6
8f76248
bb5f12b
756b1fb
fe04e09
2e7f7ef
effccc1
649d09d
05e681d
e40b311
81c34ac
3bc9987
20e11ec
095130f
a6353d1
d122b1f
957499d
a033e43
571ce65
07f9a1d
444b14f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -682,7 +682,12 @@ impl ColumnChunkMetaData { | |
self.file_path.as_deref() | ||
} | ||
|
||
/// Byte offset in `file_path()`. | ||
/// Byte offset of `ColumnMetaData` in `file_path()`. | ||
/// | ||
/// Note that the meaning of this field has been inconsistent between implementations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
/// so its use has since been deprecated in the Parquet specification. Modern implementations | ||
/// will set this to `0` to indicate that the `ColumnMetaData` is solely contained in the | ||
/// `ColumnChunk` struct. | ||
pub fn file_offset(&self) -> i64 { | ||
self.file_offset | ||
} | ||
|
@@ -1040,6 +1045,14 @@ impl ColumnChunkMetaDataBuilder { | |
} | ||
|
||
/// Sets file offset in bytes. | ||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// | ||
/// This field was meant to provide an alternate to storing `ColumnMetadata` directly in | ||
/// the `ColumnChunkMetadata`. However, most Parquet readers assume the `ColumnMetadata` | ||
/// is stored inline and ignore this field. | ||
#[deprecated( | ||
since = "53.0.0", | ||
note = "The Parquet specification requires this field to be 0" | ||
)] | ||
pub fn set_file_offset(mut self, value: i64) -> Self { | ||
self.0.file_offset = value; | ||
self | ||
|
@@ -1453,7 +1466,6 @@ mod tests { | |
let col_metadata = ColumnChunkMetaData::builder(column_descr.clone()) | ||
.set_encodings(vec![Encoding::PLAIN, Encoding::RLE]) | ||
.set_file_path("file_path".to_owned()) | ||
.set_file_offset(100) | ||
.set_num_values(1000) | ||
.set_compression(Compression::SNAPPY) | ||
.set_total_compressed_size(2000) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -649,13 +649,10 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> { | |
)); | ||
} | ||
|
||
let file_offset = self.buf.bytes_written() as i64; | ||
|
||
let map_offset = |x| x - src_offset + write_offset as i64; | ||
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr()) | ||
.set_compression(metadata.compression()) | ||
.set_encodings(metadata.encodings().clone()) | ||
.set_file_offset(file_offset) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Below in line 677 the metadata is still written to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, maybe this needs to be made clearer in the documentation. The spec still requires the field to be written so old readers don't break (i.e. it is still marked Oh, I misunderstood your comment...the write below is correct as this is writing the copy of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The code is hard to follow, but my understanding is that this method copies an existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that this code: close.metadata = builder.build()?; Sets the metadata that will be eventually written into the file footer. This code, perhaps what @jhorstmann is referring to (the line numbers appear to have changed): SerializedPageWriter::new(self.buf).write_metadata(&metadata)?; Also seems to write the contents of In this case, it does seem like it could be avoided too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, ok, I had somehow convinced myself that this method was in the normal write path and that the write in question was necessary :( I think I need to add a check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, that rippled a bit. 😅 Let me know if this is ok, or if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think removing write_metadata looks good to me |
||
.set_total_compressed_size(metadata.compressed_size()) | ||
.set_total_uncompressed_size(metadata.uncompressed_size()) | ||
.set_num_values(metadata.num_values()) | ||
|
@@ -680,7 +677,6 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> { | |
} | ||
} | ||
|
||
SerializedPageWriter::new(self.buf).write_metadata(&metadata)?; | ||
let (_, on_close) = self.get_on_close(); | ||
on_close(close) | ||
} | ||
|
@@ -808,14 +804,6 @@ impl<'a, W: Write + Send> PageWriter for SerializedPageWriter<'a, W> { | |
Ok(spec) | ||
} | ||
|
||
fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()> { | ||
let mut protocol = TCompactOutputProtocol::new(&mut self.sink); | ||
metadata | ||
.to_column_metadata_thrift() | ||
.write_to_out_protocol(&mut protocol)?; | ||
Ok(()) | ||
} | ||
|
||
fn close(&mut self) -> Result<()> { | ||
self.sink.flush()?; | ||
Ok(()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the major functional change I think -- I expect it would result in slightly smaller parquet files as the per column metadata is no longer written twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac is here some impl using this metadata? Would a flag being better here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has been discussed in the ML and should be safe to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://lists.apache.org/thread/r6r2cvzrdoorq6h6gqwh0b1hbfbhxv29
apache/parquet-format#440
Would you mind find the detail message for that? I'll linking it to C++ issues there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best evidence that no readers use the copy outside the footer is this https://lists.apache.org/thread/ob2szsb0cv6fpwmvknss9jot1oqnxzsp
If any readers were using the alternate metadata they would have submitted bug reports by now 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sync with gang and I got the info that Parquet-Java doesn't write that and doesn't impl doesn't read that, let's rush forward