PARQUET-2470: Update website with larger ecosystem emphasis #59

alamb · 2024-05-13T20:17:04Z

Rationale

As described on https://issues.apache.org/jira/browse/PARQUET-2470, Parquet's role in the analytics ecosystem is substantial.

However, https://parquet.apache.org/ currently emphasis Parquet's role in the Hadoop ecosystem. I think this causes confusion in several ways:

It implies that parquet is only focused on Hadoop, when I think it is a critical technology across other ecosystems that are unrelated to hadoop (e.g. Apache Iceberg, Delta Lake, etc)
It may further the perception that the Apache Parquet project only focuses on / cares about Hadoop / Java implementation

Changes

Update the home page content to mirror the Apache Project Description https://projects.apache.org/project.html?parquet (which does not mention Hadoop specifically)

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, and Python.

Before this PR

After the PR

content/en/_index.md

vinooganesh · 2024-05-13T20:19:10Z

+1!

etseidl

+1 Hadoop not required 😄

content/en/_index.md

etseidl · 2024-05-13T20:41:17Z

content/en/docs/Overview/_index.md

-Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
+Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. 
+It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
+Parquet is available in multiple languages including Java, C++, and Python.


Echoing @amoeba, perhaps leave out specific languages and leave it vague.

I agree it is strange to have this mention of specific technologies -- maybe we can make all three locations consistent (and more general)

I think mentioning implementation (both as end-user software and as libs) is valuable but shouldn't be part of the elevator pitch. Other formats usually solve this by a dedicated sub-section or page, e.g.:

https://jpeg.org/jpegxl/software.html (the list format is good, the fact that there's only a single implementation is not)

https://paseto.io/

https://autocrypt.org/dev-status.html

This would also allow multiple implementations for a single language, which sometimes can be valuable (e.g. if you have a backwards compatible, conservative variant and a fancy new one).

I agree 100% -- I believe we are beginning to create just such a list on #53

This set of examples is good. I have added it to https://issues.apache.org/jira/browse/PARQUET-2310 which tracks these examples

Co-authored-by: Ed Seidl <[email protected]>

julienledem

This looks great. Thank you for taking the initiative. Hadoop is not required indeed. Perhaps at some point we should rename parquet-mr to parquet-java?

alamb

Per the feedback here https://github.com/apache/parquet-site/pull/59/files#r1599769911 I have updated the text in all three places to be

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides high performance data compression and encoding schemes to handle complex data in bulk.

From my perspective this PR is now ready to merge

Thanks everyone for the reviews and comments

vinooganesh · 2024-05-15T17:03:28Z

content/en/docs/Overview/_index.md

@@ -6,11 +6,11 @@ description: >
  All about Parquet.
 ---

-Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
+Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
+It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.


Did we mean for this to say "high performance compression" or is it "high performance, compression"? I think it may be the latter. Or maybe "It provides performant compression and encoding schemes..." I was thinking the first versions sound too much like the compression tool rather than the format

I didn't mean for the comma or lack there of to carry any additional semantic meaning. I am happy to put a comma there if you like

No really strong feelings, was just wondering if there was a subtextual focus intended

wgtmac · 2024-05-16T14:19:56Z

Let me merge this. Thanks everyone!

alamb · 2024-05-16T16:43:12Z

Thanks @wgtmac

julienledem · 2024-05-17T00:11:37Z

Thanks!

PARQUET-2470: Update website with larger ecosystem emphasis

06cf679

alamb commented May 13, 2024

View reviewed changes

content/en/_index.md Show resolved Hide resolved

etseidl reviewed May 13, 2024

View reviewed changes

Update content/en/_index.md

daafc1d

Co-authored-by: Ed Seidl <[email protected]>

julienledem approved these changes May 14, 2024

View reviewed changes

Use uniform description and remove specific technology references

bc8a832

alamb commented May 15, 2024

View reviewed changes

alamb added 2 commits May 15, 2024 12:57

Merge remote-tracking branch 'origin/production' into alamb/less_hadoop

2cb88a0

remove conflict marker

11aa0f7

vinooganesh reviewed May 15, 2024

View reviewed changes

wgtmac approved these changes May 16, 2024

View reviewed changes

wgtmac merged commit 5f690a3 into apache:production May 16, 2024

alamb deleted the alamb/less_hadoop branch May 16, 2024 16:43

This was referenced May 19, 2024

PARQUET-2478: Update README with link to parquet website apache/parquet-java#1355

Merged

PARQUET-2479: Update README with link to parquet website, clarify contents apache/parquet-format#243

Merged

asfimport mentioned this pull request Jun 23, 2024

Update the website to describe the larger role of Parquet #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2470: Update website with larger ecosystem emphasis #59

PARQUET-2470: Update website with larger ecosystem emphasis #59

alamb commented May 13, 2024 •

edited

Loading

vinooganesh commented May 13, 2024

etseidl left a comment

etseidl May 13, 2024

alamb May 13, 2024

crepererum May 14, 2024

alamb May 14, 2024

julienledem left a comment

alamb left a comment

vinooganesh May 15, 2024 •

edited

Loading

alamb May 15, 2024

vinooganesh May 15, 2024

wgtmac commented May 16, 2024

alamb commented May 16, 2024

julienledem commented May 17, 2024

PARQUET-2470: Update website with larger ecosystem emphasis #59

PARQUET-2470: Update website with larger ecosystem emphasis #59

Conversation

alamb commented May 13, 2024 • edited Loading

Rationale

Changes

Before this PR

After the PR

vinooganesh commented May 13, 2024

etseidl left a comment

Choose a reason for hiding this comment

etseidl May 13, 2024

Choose a reason for hiding this comment

alamb May 13, 2024

Choose a reason for hiding this comment

crepererum May 14, 2024

Choose a reason for hiding this comment

alamb May 14, 2024

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

vinooganesh May 15, 2024 • edited Loading

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

vinooganesh May 15, 2024

Choose a reason for hiding this comment

wgtmac commented May 16, 2024

alamb commented May 16, 2024

julienledem commented May 17, 2024

alamb commented May 13, 2024 •

edited

Loading

vinooganesh May 15, 2024 •

edited

Loading