Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of docs about parquet format vs mr #53

Merged
merged 13 commits into from
May 15, 2024
37 changes: 37 additions & 0 deletions content/en/docs/Overview/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,40 @@ description: >
---

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) repositories.


### parquet-format

The parquet-format repository hosts the official specification of the Apache Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files.

As a repository focused on specification, the parquet-format repository does not contain source code.


### parquet-mr

The parquet-mr repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. The "mr" stands for MapReduce. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Apache Parquet files.

The parquet-mr repository contains an implementation of the Apache Parquet format. There are a number of other Parquet format implementations, which are listed below.

Included in parquet-mr:
* Java Implementation: It contains the core Java implementation of the Apache Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop.

* Utilities and APIs: It provides various utilities and APIs for working with Apache Parquet files, including tools for data import/export, schema management, and data conversion.


### Other Clients / Libraries / Tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to generally cover the parquet-cpp / arrow situation as this is also leading sometimes to a bit of confusion. The parquet part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a list of the implementations that I'm aware of / that are referenced here below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus.


The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools.

Here is a non-exhaustive list of Parquet implementations:

* [Parquet-mr](https://github.com/apache/parquet-mr)
* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html))
* [Parquet Go, a subproject for Arrow Go](https://github.com/apache/arrow/tree/main/go/parquet) ([documentation](https://github.com/apache/arrow/tree/main/go))
* [Parquet Rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
* [cuDF](https://github.com/rapidsai/cudf)
* [Apache Impala](https://github.com/apache/impala)
* [DuckDB](https://github.com/duckdb/duckdb)
* [fastparquet, a Python implementation of the Apache Parquet format](https://github.com/dask/fastparquet)