Improve support for Parquet in abstract files sources (e.g. S3) #5760
Labels
area/connectors
Connector related issues
autoteam
connectors/source/s3
connectors/sources-files
team/connectors-python
type/enhancement
New feature or request
Parquet support in this PR. Currently we're not actively supporting partitioned parquet datasets, but this is quite a common use-case.
The way abstract files source works iterating through file by file makes reading a partitioned parquet dataset hard. See how PyArrow can handle this here and here.
At the moment I think the connector would work sort of, but it could have quite poor performance and more importantly miss out on the columns that the parquet dataset is being partitioned on (untested but I think that would be the case).
A couple of enhancements that would be good here:
The text was updated successfully, but these errors were encountered: