Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of docs about parquet format vs mr #53

Merged
merged 13 commits into from
May 15, 2024

Conversation

vinooganesh
Copy link
Collaborator

@vinooganesh vinooganesh commented Apr 28, 2024

@wgtmac @Fokko @gszadovszky @shangxinli @jacques-n

This is a hopefully non-controversial v0 explanation of the difference between the parquet-format and parquet-mr repos. Always happy to provide more depth and open to feedback here.

It doesn't include any of the "v2" stuff that we were discussing on the email chain, but that can be a quick follow up PR if people think including that would be helpful.

* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion.


### Other Clients / Libraries / Tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to generally cover the parquet-cpp / arrow situation as this is also leading sometimes to a bit of confusion. The parquet part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a list of the implementations that I'm aware of / that are referenced here below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus.


The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.

Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this space intentional?

Suggested change
Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md).
Parquet MR can be seen as the "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md).

Slight suggestion on the wording.

Copy link

@tustvold tustvold Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parquet-mr actually the reference implementation, or would implementation in say parquet-cpp and parquet-rs be sufficient for standardisation of a feature? Are there already examples of this? Perhaps page indices which I believe were first implemented as part of Impala?

Perhaps it is beyond the scope of this issue, but I think the relationship between parquet-mr and parquet the specification is at the core of this issue, and the broader governance ambiguity surrounding parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: parquet-mr and cudf were used as the two different PoC implementations in the vote for format change https://issues.apache.org/jira/browse/PARQUET-2261.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: page indices (aka column indexes) were implemented in Impala and parquet-mr at the same time (in the same room actually).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tweak the wording - haha you can probably see I was going back and forth about whether or not to call it "a" reference implementation or "the" reference implementation. also h/t to @gszadovszky who's wording I borrowed for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up -- we are discussing reference implementations on the mailing list: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm

Here is a non-exhaustive list of Parquet implementations:

* [parquet-mr](https://github.com/apache/parquet-mr)
* [parquet-cpp](https://github.com/apache/parquet-cpp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [parquet-cpp](https://github.com/apache/parquet-cpp)
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xhochy! For my own knowledge, is parquet-cpp effectively deprecated in favor of https://github.com/apache/arrow/tree/main/cpp/src/parquet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We state that at the top of the README but this is sadly quite often overlooked.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend you add one more commit that moves all files to a deprecated directory and only have a readme that says it has moved. That way the history is still easily accessible but nobody gets confused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @jacques-n 's suggestion or it may be worth just archiving the repo on github.

* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
* [cudf](https://github.com/rapidsai/cudf)
* [impala](https://github.com/apache/impala)
* [duckdb](https://github.com/duckdb/duckdb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure wha the intent of this list is there are a couple of others off the top of my mind:

  • arrow2 rust
  • fastparquet python
  • Parquet Go (also housed in Arrow mono-repo not part of the Parquet PMC)
  • polars (rust) I believe forked arrow2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is worth drawing a distinction between parquet implementations that are made for direct external consumption, e.g. fastparquet, parquet-cpp, parquet-rs, parquet-go and those that are implementation details of a broader engine - e.g. polars, duckdb, impala?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this empty file fill the contents later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had the plan to split out parquet-format and the client libs into separate pages, but decided against that

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


### Parquet Format

The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please make spelling of the project name consistent: parquet-format not "Parquet Format".


The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.

Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can we write parquet-mr consistently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of the Parquet format", not "of parquet-format". We are implementing the spec, not its repository :-)

Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below.

Included in parquet-mr:
* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually some Scala code in it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation.

Here is a non-exhaustive list of Parquet implementations:

* [parquet-mr](https://github.com/apache/parquet-mr)
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
Copy link
Member

@pitrou pitrou May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html))

* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
* [cudf](https://github.com/rapidsai/cudf)
* [impala](https://github.com/apache/impala)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Apache Impala?

@vinooganesh
Copy link
Collaborator Author

@pitrou - Updated with per your feedback

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vinooganesh -- I think this is a great start and we should merge this PR and iterate on any remaining content.


The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.

The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any reason to add quotes

Suggested change
The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.
The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would second the removal of the quotes here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rendered this PR locally (with the instructions from #56) and it looks like a very nice improvement to me

Screenshot 2024-05-11 at 9 13 12 AM Screenshot 2024-05-11 at 9 13 19 AM

@vinooganesh
Copy link
Collaborator Author

@wgtmac - given consensus here, would you be able to merge?

@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

@xhochy @pitrou @tustvold Would you like to take a final pass?

Will merge it if there is no further comment next week.

@vinooganesh
Copy link
Collaborator Author

Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs.

@alamb
Copy link
Collaborator

alamb commented May 12, 2024

Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs.

I think documenting the current / institutional knowledge is superful helpful. Thank you for pushing this forward

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in general, but need to polish terminology and spelling.


### parquet-mr

The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though.


The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.

The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "a" reference implementation, it means there are other ones. But I don't see them listed here.

Either list all reference implementations explicitly, or make this "the" reference implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we knew what those references implementations are, I agree it would be valuable to document.

However, I think there is consensus required before we made such a determination

Thus, for this PR I suggest:

  1. Remove the word "reference"
  2. File a follow on ticket / discussion in the mailing list to figure out what should be listed as references implementations
Suggested change
The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.
The parquet-mr repo contains an implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been a lot of conversations about this: #53 (comment) (and others on the thread) and I'm inclined to keep this as is. I don't think we need to exhaustively list other the reference reference implementing when there is a list of implementations below. @gszadovszky has also called this a reference implementation and I think it helps clarify the relationship between the parquet-format and parquet-mr. I'm more than happy to update this once the community has reached a consensus after the mailing list discussion that @alamb suggested though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't agree on what a reference implementation then we should not list parquet-mr as a reference implementation. The term "reference implementation" has an official connotation and implies a specific status; it certainly should not be assigned lightly.

Copy link
Collaborator Author

@vinooganesh vinooganesh May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @gszadovszky @Fokko @xhochy @wgtmac. I don't want to further block this PR by settling this beforehand, so I'm going to remove the word "reference" and we can add it back if we want to in a subsequent PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a thread on the mailing list about this topic to see if we can reach consensus: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm


### parquet-mr

The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we explain what "mr" stands for? It's a mystery for most people.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I assume it's mapreduce, but please correct me if I'm wrong


### parquet-format

The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Apache Parquet" perhaps?


* [parquet-mr](https://github.com/apache/parquet-mr)
* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html))
* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure all these project names/references are appropriately spelled and capitalized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could also mention for Go (and similarly for rust below) that it is a subproject of Arrow Go, similarly like done for C++ above

(it also seems there are several Parquet Go implementations, others that are not part of the Arrow project, so it's good to clarify this one is Arrow related. But at the same time it's not entirely clear what the criterium is for being listed here ..)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that (as a follow on PR) we turn this list into a table, something like

Project Language Website API Docs
parquet-mr Java link
parquet-cpp C++ link link
parquet-rs Rust link link

Also I recommend the criteria for being listed here is "Open source implementations of the parquet format" (which is a low bar to be sure

I would be happy to propose such changes as a follow on PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this work is tracked by https://issues.apache.org/jira/browse/PARQUET-2310 and there is a draft at #34. Once this PR gets merged we'll start working on that

* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet)
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
* [cudf](https://github.com/rapidsai/cudf)
* [apache impala](https://github.com/apache/impala)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly "Apache Impala".

* [cudf](https://github.com/rapidsai/cudf)
* [apache impala](https://github.com/apache/impala)
* [duckdb](https://github.com/duckdb/duckdb)
* [fast-parquet python](https://github.com/dask/fastparquet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "fastparquet". Perhaps "fastparquet, a Python implementation of the Parquet format".

@alamb
Copy link
Collaborator

alamb commented May 13, 2024

Given the size and substance of this PR, it is unlikely we will get it perfect the first time. Also, I don't see any disagreement across commenters on the value of this information.

I would personally suggest that we address all the outstanding comments as best as possible, merge this PR, and then iterate on the content in subsequent PRs

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last spelling nit 😉. Thank you @vinooganesh for starting this documentation project.

content/en/docs/Overview/_index.md Outdated Show resolved Hide resolved
@alamb
Copy link
Collaborator

alamb commented May 15, 2024

I think we should merge this PR and begin working on the next steps (feature compatibility matrix)

This is quite an impressive list of ✅ 🚀

Screenshot 2024-05-15 at 7 44 35 AM

@wgtmac wgtmac merged commit de58c2d into apache:production May 15, 2024
@wgtmac
Copy link
Member

wgtmac commented May 15, 2024

I just merged it. Thanks everyone!

@vinooganesh
Copy link
Collaborator Author

Thanks everyone!

@vinooganesh vinooganesh deleted the vinooganesh/addFormatVsMrDocs branch May 17, 2024 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.