[copy_from] Support the Parquet format #31173
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stacked on top of #31144
This PR adds a new implementation of a
OneshotFormat
that supports reading in Parquet files. The decoding is built on top of theArrowReader
implemented in #30958.The strategy we use for reading and decoding Parquet files is the "split work" stage of a oneshot source will read the footer metadata from a Parquet file to determine the Row Group boundaries. The Row Groups are then distributed among timely works for fetching and eventual decoding.
Note: Through experimentation I found that Row Groups seem to typically be 10s of MB large, which makes them a pretty good unit of parallelization.
Motivation
Fixes https://github.com/MaterializeInc/database-issues/issues/8853
Tips for reviewer
Review on the final commit, the one titled "start, support Parquet for COPY FROM"
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.