-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading partitioned Parquet files #133
Comments
Is there any reason to limit this to parquet files? In spark, this functionality is shared between csv, json, orc and parquet. Maybe the implementation could target the shared file listing in Considering #204 (adding partition pruning), it may be sensible to already implement the partition pruning logic early in the file listing procedure itself, as it could save on file listing operations, which tend to be expensive in particular on cloud storage (EBS). I'd love to work on this, but I'd need a bit of guidance on the preferred approach. |
I do not think there is any reason to limit to parquet files. Parquet files are probably the most important usecase initially but the functionality would be useful for everyone I think the first thing to do might be to write up a high level proposal (we have used google docs to good effect in the past). The first work needed (for this ticket) is probably to do a recursive directory traversal and find all parquet (or other) formats in subdirectories. Then there is probably work to interpret paths as their relevant partition keys, and then implement partition pruning (based on the existing row group pruning code, I would think) |
Is there a name for this sort of thing? I've seen it called Hive partitioning somewhere, but I couldn't find any kind of standard, particularly regarding the way that values should be parsed into types. |
I do not know of any standard -- the systems I have heard of basically "follow what hive did" -- though if someone else has a reference that would be great. |
just to check, what hive did in this context is the |
@jorgecarleitao yes I am also not aware of any standard - also implementations do differ in some subtle ways. I think we have to compare to hive / spark / etc. On the types - it depends if the type already is set in the schema or if some inference is used for the paths. I think we can first start with adding partition columns to the table schema so we can actually parse the locations based on the type - and add automatic detection of types (like CSV) later. |
Hive partitioning is the most commonly used scheme, but there are other schemes as well, for example, the python arrow package supports both directory partitioning and hive partitioning: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html?highlight=partition. I agree with @Dandandan that we should add the concept of partition column first, then tackle how we ser/de partition values from file paths. I can see us going the python arrow route as well, i.e. supporting multiple partitioning schemes. |
The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem. Here is an example of syntax. Definitely needs a Google Doc treatment to outline the details. I just wanted to comment to show how one can split the filesystem / storage discovery from the idea of partitions. This is certainly easy syntax for test cases as 100% SQL based interaction. CREATE EXTERNAL TABLE users ( ALTER TABLE user This is perhaps a UNION ALL of hidden tables for each partition. |
I agree |
I have tried to come up with a design document regarding table formats and partitioning: Sorry its length. Inputs are very welcome! |
Thank you @rdettai for the detailed write up, I recommend you sending it to the arrow dev mailing list too since it's a pretty major design change. |
I think this can be closed now with @rdettai 's new awesome listing table provider. |
|
oh right, but at least we now have a single implementation to cover all file formats :D |
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11019
Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values.
/path/to/files/KEY1=value/KEY2=value/files
The text was updated successfully, but these errors were encountered: