-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of docs about parquet format vs mr #53
Changes from 2 commits
d8ba2ab
f760a75
eb5eaaa
fab717b
59e761c
95c76fd
ce4b854
1f32099
8cea17e
0c1e0cf
1ef775d
f99ac57
452fa66
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -7,3 +7,37 @@ description: > | |||||
--- | ||||||
|
||||||
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. | ||||||
|
||||||
This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. | ||||||
|
||||||
|
||||||
### Parquet Format | ||||||
|
||||||
The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. | ||||||
|
||||||
As a repository focused on specification, the parquet-format repository does not contain source code. | ||||||
|
||||||
|
||||||
### Parquet-MR | ||||||
|
||||||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, can we explain what "mr" stands for? It's a mystery for most people. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pitrou I assume it's mapreduce, but please correct me if I'm wrong |
||||||
|
||||||
Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, can we write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "of the Parquet format", not "of parquet-format". We are implementing the spec, not its repository :-) |
||||||
|
||||||
Included in parquet-mr: | ||||||
* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there actually some Scala code in it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation. |
||||||
|
||||||
* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. | ||||||
|
||||||
|
||||||
### Other Clients / Libraries / Tools | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make sense to generally cover the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a list of the implementations that I'm aware of / that are referenced here below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus. |
||||||
|
||||||
The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. | ||||||
|
||||||
Here is a non-exhaustive list of Parquet implementations: | ||||||
|
||||||
* [parquet-mr](https://github.com/apache/parquet-mr) | ||||||
* [parquet-cpp](https://github.com/apache/parquet-cpp) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @xhochy! For my own knowledge, is parquet-cpp effectively deprecated in favor of https://github.com/apache/arrow/tree/main/cpp/src/parquet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. We state that at the top of the README but this is sadly quite often overlooked. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend you add one more commit that moves all files to a deprecated directory and only have a readme that says it has moved. That way the history is still easily accessible but nobody gets confused. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like @jacques-n 's suggestion or it may be worth just archiving the repo on github. |
||||||
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||||||
vinooganesh marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
* [cudf](https://github.com/rapidsai/cudf) | ||||||
vinooganesh marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
* [impala](https://github.com/apache/impala) | ||||||
vinooganesh marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this empty file fill the contents later? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I initially had the plan to split out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please make spelling of the project name consistent:
parquet-format
not "Parquet Format".