Improve support for Parquet in abstract files sources (e.g. S3) #5760

Phlair · 2021-08-31T14:18:06Z

Parquet support in this PR. Currently we're not actively supporting partitioned parquet datasets, but this is quite a common use-case.

The way abstract files source works iterating through file by file makes reading a partitioned parquet dataset hard. See how PyArrow can handle this here and here.

At the moment I think the connector would work sort of, but it could have quite poor performance and more importantly miss out on the columns that the parquet dataset is being partitioned on (untested but I think that would be the case).

A couple of enhancements that would be good here:

Make adjustments to abstractions in abstract files source to allow a format parser to take in multiple file paths for more optimised processing
Extend parquet reader using above changes to handle Parquet datasets better (see links above to PyArrow docs)
Could even allow custom cursor fielding in the case where Parquet is partitioned on date-like updated_at values, further improving performance potential.

artem1205 · 2023-05-10T09:25:14Z

Done in #25937

Phlair added type/enhancement New feature or request area/connectors Connector related issues labels Aug 31, 2021

igrankova added connectors/sources-files connectors/source/s3 labels Jan 14, 2022

bleonard added autoteam team/connectors-python labels Apr 26, 2022

lazebnyi assigned grubberr Apr 25, 2023

lazebnyi assigned artem1205 and unassigned grubberr May 8, 2023

artem1205 mentioned this issue May 9, 2023

Source S3: support parquet dataset #25937

Merged

artem1205 closed this as completed May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Phlair commented Aug 31, 2021 •

edited

Loading

artem1205 commented May 10, 2023

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Comments

Phlair commented Aug 31, 2021 • edited Loading

artem1205 commented May 10, 2023

Phlair commented Aug 31, 2021 •

edited

Loading