Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of docs about parquet format vs mr #53

Merged
merged 13 commits into from
May 15, 2024
34 changes: 34 additions & 0 deletions content/en/docs/Overview/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,37 @@ description: >
---

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects.


### Parquet Format

The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please make spelling of the project name consistent: parquet-format not "Parquet Format".


As a repository focused on specification, the parquet-format repository does not contain source code.


### Parquet-MR

The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we explain what "mr" stands for? It's a mystery for most people.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I assume it's mapreduce, but please correct me if I'm wrong


Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can we write parquet-mr consistently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of the Parquet format", not "of parquet-format". We are implementing the spec, not its repository :-)


Included in parquet-mr:
* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually some Scala code in it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation.


* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion.


### Other Clients / Libraries / Tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to generally cover the parquet-cpp / arrow situation as this is also leading sometimes to a bit of confusion. The parquet part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a list of the implementations that I'm aware of / that are referenced here below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus.


The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools.

Here is a non-exhaustive list of Parquet implementations:

* [parquet-mr](https://github.com/apache/parquet-mr)
* [parquet-cpp](https://github.com/apache/parquet-cpp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [parquet-cpp](https://github.com/apache/parquet-cpp)
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xhochy! For my own knowledge, is parquet-cpp effectively deprecated in favor of https://github.com/apache/arrow/tree/main/cpp/src/parquet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We state that at the top of the README but this is sadly quite often overlooked.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend you add one more commit that moves all files to a deprecated directory and only have a readme that says it has moved. That way the history is still easily accessible but nobody gets confused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @jacques-n 's suggestion or it may be worth just archiving the repo on github.

* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
vinooganesh marked this conversation as resolved.
Show resolved Hide resolved
* [cudf](https://github.com/rapidsai/cudf)
vinooganesh marked this conversation as resolved.
Show resolved Hide resolved
* [impala](https://github.com/apache/impala)
vinooganesh marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this empty file fill the contents later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had the plan to split out parquet-format and the client libs into separate pages, but decided against that

Empty file.