[copy_from] Support the Parquet format #31173

ParkMyCar · 2025-01-24T00:18:21Z

Stacked on top of #31144

This PR adds a new implementation of a OneshotFormat that supports reading in Parquet files. The decoding is built on top of the ArrowReader implemented in #30958.

The strategy we use for reading and decoding Parquet files is the "split work" stage of a oneshot source will read the footer metadata from a Parquet file to determine the Row Group boundaries. The Row Groups are then distributed among timely works for fetching and eventual decoding.

Note: Through experimentation I found that Row Groups seem to typically be 10s of MB large, which makes them a pretty good unit of parallelization.

Motivation

Fixes https://github.com/MaterializeInc/database-issues/issues/8853

Tips for reviewer

Review on the final commit, the one titled "start, support Parquet for COPY FROM"

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

* plumb the CSV params through to oneshot_source::CsvFormat to handle different delimiters and what not * add support for compressed CSVs using async_compression * various Bazel changes for the C dependencies from the compression algorithms

* add a new OneshotSource implementation * support the FILES and PATTERN options for COPY FROM

* Add a new Parquet OneshotFormat * Fix ranged requests for HTTP and AWS sources

ParkMyCar added 3 commits January 23, 2025 17:31

start, implementation of an S3 oneshot source

58f62b2

* add a new OneshotSource implementation * support the FILES and PATTERN options for COPY FROM

start, support Parquet for COPY FROM

3d8c1bf

* Add a new Parquet OneshotFormat * Fix ranged requests for HTTP and AWS sources

ParkMyCar requested review from a team as code owners January 24, 2025 00:18

ParkMyCar requested a review from jkosh44 January 24, 2025 00:18

ParkMyCar mentioned this pull request Jan 25, 2025

[dnr][tables] Move read-then-write plans into clusterd #31189

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[copy_from] Support the Parquet format #31173

[copy_from] Support the Parquet format #31173

ParkMyCar commented Jan 24, 2025

[copy_from] Support the Parquet format #31173

Are you sure you want to change the base?

[copy_from] Support the Parquet format #31173

Conversation

ParkMyCar commented Jan 24, 2025

Motivation

Tips for reviewer

Checklist