-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of docs about parquet format vs mr #53
First draft of docs about parquet format vs mr #53
Conversation
* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. | ||
|
||
|
||
### Other Clients / Libraries / Tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to generally cover the parquet-cpp
/ arrow
situation as this is also leading sometimes to a bit of confusion. The parquet
part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a list of the implementations that I'm aware of / that are referenced here below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus.
content/en/docs/Overview/_index.md
Outdated
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
||
Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this space intentional?
Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). | |
Parquet MR can be seen as the "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). |
Slight suggestion on the wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is parquet-mr actually the reference implementation, or would implementation in say parquet-cpp and parquet-rs be sufficient for standardisation of a feature? Are there already examples of this? Perhaps page indices which I believe were first implemented as part of Impala?
Perhaps it is beyond the scope of this issue, but I think the relationship between parquet-mr and parquet the specification is at the core of this issue, and the broader governance ambiguity surrounding parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: parquet-mr and cudf were used as the two different PoC implementations in the vote for format change https://issues.apache.org/jira/browse/PARQUET-2261.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: page indices (aka column indexes) were implemented in Impala and parquet-mr at the same time (in the same room actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak the wording - haha you can probably see I was going back and forth about whether or not to call it "a" reference implementation or "the" reference implementation. also h/t to @gszadovszky who's wording I borrowed for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up -- we are discussing reference implementations on the mailing list: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm
content/en/docs/Overview/_index.md
Outdated
Here is a non-exhaustive list of Parquet implementations: | ||
|
||
* [parquet-mr](https://github.com/apache/parquet-mr) | ||
* [parquet-cpp](https://github.com/apache/parquet-cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [parquet-cpp](https://github.com/apache/parquet-cpp) | |
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xhochy! For my own knowledge, is parquet-cpp effectively deprecated in favor of https://github.com/apache/arrow/tree/main/cpp/src/parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We state that at the top of the README but this is sadly quite often overlooked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend you add one more commit that moves all files to a deprecated directory and only have a readme that says it has moved. That way the history is still easily accessible but nobody gets confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @jacques-n 's suggestion or it may be worth just archiving the repo on github.
content/en/docs/Overview/_index.md
Outdated
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
* [cudf](https://github.com/rapidsai/cudf) | ||
* [impala](https://github.com/apache/impala) | ||
* [duckdb](https://github.com/duckdb/duckdb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure wha the intent of this list is there are a couple of others off the top of my mind:
- arrow2 rust
- fastparquet python
- Parquet Go (also housed in Arrow mono-repo not part of the Parquet PMC)
- polars (rust) I believe forked arrow2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is worth drawing a distinction between parquet implementations that are made for direct external consumption, e.g. fastparquet, parquet-cpp, parquet-rs, parquet-go and those that are implementation details of a broader engine - e.g. polars, duckdb, impala?
content/en/docs/Overview/format.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this empty file fill the contents later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had the plan to split out parquet-format
and the client libs into separate pages, but decided against that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
content/en/docs/Overview/_index.md
Outdated
|
||
### Parquet Format | ||
|
||
The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please make spelling of the project name consistent: parquet-format
not "Parquet Format".
content/en/docs/Overview/_index.md
Outdated
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
||
Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can we write parquet-mr
consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"of the Parquet format", not "of parquet-format". We are implementing the spec, not its repository :-)
content/en/docs/Overview/_index.md
Outdated
Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. | ||
|
||
Included in parquet-mr: | ||
* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually some Scala code in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation.
content/en/docs/Overview/_index.md
Outdated
Here is a non-exhaustive list of Parquet implementations: | ||
|
||
* [parquet-mr](https://github.com/apache/parquet-mr) | ||
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | |
* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) |
content/en/docs/Overview/_index.md
Outdated
* [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | ||
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
* [cudf](https://github.com/rapidsai/cudf) | ||
* [impala](https://github.com/apache/impala) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Apache Impala
?
@pitrou - Updated with per your feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vinooganesh -- I think this is a great start and we should merge this PR and iterate on any remaining content.
content/en/docs/Overview/_index.md
Outdated
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
||
The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any reason to add quotes
The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. | |
The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would second the removal of the quotes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rendered this PR locally (with the instructions from #56) and it looks like a very nice improvement to me
![Screenshot 2024-05-11 at 9 13 12 AM](https://private-user-images.githubusercontent.com/490673/329783968-f58d8c2a-a3d2-4b55-90f1-a60dd74825d1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzODgyNDksIm5iZiI6MTczOTM4Nzk0OSwicGF0aCI6Ii80OTA2NzMvMzI5NzgzOTY4LWY1OGQ4YzJhLWEzZDItNGI1NS05MGYxLWE2MGRkNzQ4MjVkMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxOTE5MDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00ZjJjZWIyOWQxZTRmZWQ3Y2JlMjA2YjE2MGRmMjExNGNkM2Q2M2Y0NGY1Njc0YTkyZjI1NGM2ZmI1YmQwNzljJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.pHTNXwS84N0a70zmUUHRDrXuzvH7LNdgi6nq5_UoGe4)
![Screenshot 2024-05-11 at 9 13 19 AM](https://private-user-images.githubusercontent.com/490673/329783970-50f70361-0880-4068-94bd-0d0cb4ae2581.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzODgyNDksIm5iZiI6MTczOTM4Nzk0OSwicGF0aCI6Ii80OTA2NzMvMzI5NzgzOTcwLTUwZjcwMzYxLTA4ODAtNDA2OC05NGJkLTBkMGNiNGFlMjU4MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxOTE5MDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xOWY3MTFmMmRmNDliNjlhNDViZDYzMTEzOTExMmY3NGRiM2NmNWYwYmY3ZjRjNTlkZjIwN2FkMDNjMjFmNzdjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.5sXcQ19D8xZ_sEzF_UuaHy-HaFKeFgPsm9z9Ibiztgg)
Co-authored-by: Andrew Lamb <[email protected]>
@wgtmac - given consensus here, would you be able to merge? |
Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs. |
I think documenting the current / institutional knowledge is superful helpful. Thank you for pushing this forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 in general, but need to polish terminology and spelling.
content/en/docs/Overview/_index.md
Outdated
|
||
### parquet-mr | ||
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though.
content/en/docs/Overview/_index.md
Outdated
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
||
The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "a" reference implementation, it means there are other ones. But I don't see them listed here.
Either list all reference implementations explicitly, or make this "the" reference implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we knew what those references implementations are, I agree it would be valuable to document.
However, I think there is consensus required before we made such a determination
Thus, for this PR I suggest:
- Remove the word "reference"
- File a follow on ticket / discussion in the mailing list to figure out what should be listed as references implementations
The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. | |
The parquet-mr repo contains an implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been a lot of conversations about this: #53 (comment) (and others on the thread) and I'm inclined to keep this as is. I don't think we need to exhaustively list other the reference reference implementing when there is a list of implementations below. @gszadovszky has also called this a reference implementation and I think it helps clarify the relationship between the parquet-format
and parquet-mr
. I'm more than happy to update this once the community has reached a consensus after the mailing list discussion that @alamb suggested though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't agree on what a reference implementation then we should not list parquet-mr as a reference implementation. The term "reference implementation" has an official connotation and implies a specific status; it certainly should not be assigned lightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gszadovszky @Fokko @xhochy @wgtmac. I don't want to further block this PR by settling this beforehand, so I'm going to remove the word "reference" and we can add it back if we want to in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started a thread on the mailing list about this topic to see if we can reach consensus: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm
content/en/docs/Overview/_index.md
Outdated
|
||
### parquet-mr | ||
|
||
The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we explain what "mr" stands for? It's a mystery for most people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I assume it's mapreduce, but please correct me if I'm wrong
content/en/docs/Overview/_index.md
Outdated
|
||
### parquet-format | ||
|
||
The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Apache Parquet" perhaps?
content/en/docs/Overview/_index.md
Outdated
|
||
* [parquet-mr](https://github.com/apache/parquet-mr) | ||
* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) | ||
* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure all these project names/references are appropriately spelled and capitalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And could also mention for Go (and similarly for rust below) that it is a subproject of Arrow Go, similarly like done for C++ above
(it also seems there are several Parquet Go implementations, others that are not part of the Arrow project, so it's good to clarify this one is Arrow related. But at the same time it's not entirely clear what the criterium is for being listed here ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend that (as a follow on PR) we turn this list into a table, something like
Project | Language | Website | API Docs |
---|---|---|---|
parquet-mr | Java | link | |
parquet-cpp | C++ | link | link |
parquet-rs | Rust | link | link |
Also I recommend the criteria for being listed here is "Open source implementations of the parquet format" (which is a low bar to be sure
I would be happy to propose such changes as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this work is tracked by https://issues.apache.org/jira/browse/PARQUET-2310 and there is a draft at #34. Once this PR gets merged we'll start working on that
content/en/docs/Overview/_index.md
Outdated
* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) | ||
* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
* [cudf](https://github.com/rapidsai/cudf) | ||
* [apache impala](https://github.com/apache/impala) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly "Apache Impala".
content/en/docs/Overview/_index.md
Outdated
* [cudf](https://github.com/rapidsai/cudf) | ||
* [apache impala](https://github.com/apache/impala) | ||
* [duckdb](https://github.com/duckdb/duckdb) | ||
* [fast-parquet python](https://github.com/dask/fastparquet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "fastparquet". Perhaps "fastparquet, a Python implementation of the Parquet format".
Given the size and substance of this PR, it is unlikely we will get it perfect the first time. Also, I don't see any disagreement across commenters on the value of this information. I would personally suggest that we address all the outstanding comments as best as possible, merge this PR, and then iterate on the content in subsequent PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last spelling nit 😉. Thank you @vinooganesh for starting this documentation project.
Co-authored-by: Ed Seidl <[email protected]>
I just merged it. Thanks everyone! |
Thanks everyone! |
@wgtmac @Fokko @gszadovszky @shangxinli @jacques-n
This is a hopefully non-controversial v0 explanation of the difference between the parquet-format and parquet-mr repos. Always happy to provide more depth and open to feedback here.
It doesn't include any of the "v2" stuff that we were discussing on the email chain, but that can be a quick follow up PR if people think including that would be helpful.