Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Closed
Phlair opened this issue Aug 31, 2021 · 1 comment
Closed

Improve support for Parquet in abstract files sources (e.g. S3) #5760

Phlair opened this issue Aug 31, 2021 · 1 comment

Comments

@Phlair
Copy link
Contributor

Phlair commented Aug 31, 2021

Parquet support in this PR. Currently we're not actively supporting partitioned parquet datasets, but this is quite a common use-case.

The way abstract files source works iterating through file by file makes reading a partitioned parquet dataset hard. See how PyArrow can handle this here and here.

At the moment I think the connector would work sort of, but it could have quite poor performance and more importantly miss out on the columns that the parquet dataset is being partitioned on (untested but I think that would be the case).

A couple of enhancements that would be good here:

  • Make adjustments to abstractions in abstract files source to allow a format parser to take in multiple file paths for more optimised processing
  • Extend parquet reader using above changes to handle Parquet datasets better (see links above to PyArrow docs)
  • Could even allow custom cursor fielding in the case where Parquet is partitioned on date-like updated_at values, further improving performance potential.
@artem1205
Copy link
Collaborator

Done in #25937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants