Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1430: Use Hadoop 3.3.5 shaded clients #1509

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 21, 2023

What changes were proposed in this pull request?

Currently, Apache ORC project uses three properties.

    <hadoop.version>2.7.3</hadoop.version>
    <min.hadoop.version>2.7.3</min.hadoop.version>
    <tools.hadoop.version>2.7.3</tools.hadoop.version>

This aims the following for Apache ORC 2.0.0.

  1. Use Hadoop 3.3.5 shaded clients.
  2. Remove min.hadoop.version and tools.hadoop.version in favor of hadoop.version
  3. Ban non-shaded clients from now.
<bannedDependencies>
  <excludes>
    <exclude>org.apache.hadoop:hadoop-common</exclude>
    <exclude>org.apache.hadoop:hadoop-hdfs-client</exclude>
    <exclude>org.apache.hadoop:hadoop-mapreduce-client-core</exclude>
    <exclude>org.apache.hadoop:hadoop-mapreduce-client-jobclient</exclude>
  </excludes>
  <searchTransitive>true</searchTransitive>
</bannedDependencies>

Note that all changes are pom.xml files. There is no code change.

Why are the changes needed?

  • Hadoop 3's shaded client removes lots of complexity from the downstream clients.
  • It's stable because Apache Spark community has been using Hadoop 3's shaded client from Apache Spark 3.2.0 (October 13, 2021) via https://issues.apache.org/jira/browse/SPARK-33212.

How was this patch tested?

Pass the CIs.

Also, I validated there is no side-effect at Apache Spark side. The following is the change set when Apache Spark upgrades from Apache ORC 1.8.3 (AS-IS) to Apache ORC 2.0.0-SNAPSHOT.

-aircompressor/0.21//aircompressor-0.21.jar
+aircompressor/0.24//aircompressor-0.24.jar
-orc-core/1.8.3/shaded-protobuf/orc-core-1.8.3-shaded-protobuf.jar
-orc-mapreduce/1.8.3/shaded-protobuf/orc-mapreduce-1.8.3-shaded-protobuf.jar
-orc-shims/1.8.3//orc-shims-1.8.3.jar
+orc-core/2.0.0-SNAPSHOT/shaded-protobuf/orc-core-2.0.0-SNAPSHOT-shaded-protobuf.jar
+orc-mapreduce/2.0.0-SNAPSHOT/shaded-protobuf/orc-mapreduce-2.0.0-SNAPSHOT-shaded-protobuf.jar
+orc-shims/2.0.0-SNAPSHOT//orc-shims-2.0.0-SNAPSHOT.jar

@dongjoon-hyun dongjoon-hyun changed the title ORC-1430: Use Hadoop 3 shaded clients ORC-1430: Use Hadoop 3.3.5 shaded clients May 21, 2023
@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone May 21, 2023
@dongjoon-hyun
Copy link
Member Author

Could you review this, @williamhyun and @wgtmac ?

Copy link
Member

@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you so much! Merged to main for Apache ORC 2.0.

@dongjoon-hyun dongjoon-hyun deleted the ORC-1430 branch May 21, 2023 08:09
@wgtmac
Copy link
Member

wgtmac commented May 21, 2023

Thanks @dongjoon-hyun. Good to know a better approach to use Hadoop clients.

@dongjoon-hyun
Copy link
Member Author

Thank you, @wgtmac !

dongjoon-hyun added a commit that referenced this pull request Oct 1, 2023
### What changes were proposed in this pull request?

This PR aims to remove Apache Zookeeper Runtime dependency for Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC 1.4.0 added Zookeeper depdency to the pom file to reduce the uber file size because Zookeeper is used by Hadoop Common module at that time.
- #96

After ORC-1430, Apache ORC 2.0.0 uses Hadoop shaded clients which shaded Zookeeper runtime dependency. We can remove Zookeeper dependency.

- #1509
- #1554

### How was this patch tested?

Pass the CIs.

This closes #1572 .

Closes #1630 from dongjoon-hyun/ORC-1514.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Dec 27, 2023
### What changes were proposed in this pull request?

This PR aims to use `Hadoop Vectored IO` always in Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC 2.0.0 is ready to use this new Hadoop feature.
  - #1509
  - #1554
  - [Hadoop Vectored IO Presentation](https://docs.google.com/presentation/d/1U5QRN4etbM7gkbnGO3OW4sCfUZx9LqJN/)
    > Works great everywhere; radical benefit in object stores

### How was this patch tested?

Pass the CIs.

Closes #1708 from williamhyun/hadoopvectorized.

Lead-authored-by: William Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: HarshitGupta11 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
williamhyun pushed a commit that referenced this pull request Jan 4, 2024
…2_7 classes

### What changes were proposed in this pull request?

This PR aims to remove `HadoopShimsPre2_3`, `HadoopShimsPre2_6`, `HadoopShimsPre2_7` classes and use `HadoopShimsCurrent` always.

### Why are the changes needed?

1. `HadoopShimsCurrent` supports not only Apache Hadoop 3+ but also Apache Hadoop 2.7+.

2. Apache ORC 2.0 uses Hadoop 3.x shaded client doesn't need old shims for Hadoop 2.6 and olders.
    - #1509

In addition, Apache Spark community also has been using the shaded Hadoop client since Spark 3.2 (SPARK-33212) and dropped `Hadoop 2` profile via [SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452) completely at Spark 3.5.0.

### How was this patch tested?

Pass the CIs.

Closes #1724 from dongjoon-hyun/ORC-1569.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: William Hyun <[email protected]>
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

Currently, Apache ORC project uses three properties.
```
    <hadoop.version>2.7.3</hadoop.version>
    <min.hadoop.version>2.7.3</min.hadoop.version>
    <tools.hadoop.version>2.7.3</tools.hadoop.version>
```

This aims the following for Apache ORC 2.0.0.
1. Use Hadoop 3.3.5 shaded clients.
2. Remove `min.hadoop.version` and `tools.hadoop.version` in favor of `hadoop.version`
3. Ban non-shaded clients from now.
```
<bannedDependencies>
  <excludes>
    <exclude>org.apache.hadoop:hadoop-common</exclude>
    <exclude>org.apache.hadoop:hadoop-hdfs-client</exclude>
    <exclude>org.apache.hadoop:hadoop-mapreduce-client-core</exclude>
    <exclude>org.apache.hadoop:hadoop-mapreduce-client-jobclient</exclude>
  </excludes>
  <searchTransitive>true</searchTransitive>
</bannedDependencies>
```

Note that all changes are `pom.xml` files. There is no code change.

### Why are the changes needed?

- Hadoop 3's shaded client removes lots of complexity from the downstream clients.
- It's stable because Apache Spark community has been using Hadoop 3's shaded client from Apache Spark 3.2.0 (October 13, 2021) via https://issues.apache.org/jira/browse/SPARK-33212.

### How was this patch tested?

Pass the CIs.

Also, I validated there is no side-effect at Apache Spark side. The following is the change set when Apache Spark upgrades from Apache ORC 1.8.3 (AS-IS) to Apache ORC 2.0.0-SNAPSHOT.

```
-aircompressor/0.21//aircompressor-0.21.jar
+aircompressor/0.24//aircompressor-0.24.jar
-orc-core/1.8.3/shaded-protobuf/orc-core-1.8.3-shaded-protobuf.jar
-orc-mapreduce/1.8.3/shaded-protobuf/orc-mapreduce-1.8.3-shaded-protobuf.jar
-orc-shims/1.8.3//orc-shims-1.8.3.jar
+orc-core/2.0.0-SNAPSHOT/shaded-protobuf/orc-core-2.0.0-SNAPSHOT-shaded-protobuf.jar
+orc-mapreduce/2.0.0-SNAPSHOT/shaded-protobuf/orc-mapreduce-2.0.0-SNAPSHOT-shaded-protobuf.jar
+orc-shims/2.0.0-SNAPSHOT//orc-shims-2.0.0-SNAPSHOT.jar
```

Closes apache#1509 from dongjoon-hyun/ORC-1430.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR aims to remove Apache Zookeeper Runtime dependency for Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC 1.4.0 added Zookeeper depdency to the pom file to reduce the uber file size because Zookeeper is used by Hadoop Common module at that time.
- apache#96

After ORC-1430, Apache ORC 2.0.0 uses Hadoop shaded clients which shaded Zookeeper runtime dependency. We can remove Zookeeper dependency.

- apache#1509
- apache#1554

### How was this patch tested?

Pass the CIs.

This closes apache#1572 .

Closes apache#1630 from dongjoon-hyun/ORC-1514.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR aims to use `Hadoop Vectored IO` always in Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC 2.0.0 is ready to use this new Hadoop feature.
  - apache#1509
  - apache#1554
  - [Hadoop Vectored IO Presentation](https://docs.google.com/presentation/d/1U5QRN4etbM7gkbnGO3OW4sCfUZx9LqJN/)
    > Works great everywhere; radical benefit in object stores

### How was this patch tested?

Pass the CIs.

Closes apache#1708 from williamhyun/hadoopvectorized.

Lead-authored-by: William Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: HarshitGupta11 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…2_7 classes

### What changes were proposed in this pull request?

This PR aims to remove `HadoopShimsPre2_3`, `HadoopShimsPre2_6`, `HadoopShimsPre2_7` classes and use `HadoopShimsCurrent` always.

### Why are the changes needed?

1. `HadoopShimsCurrent` supports not only Apache Hadoop 3+ but also Apache Hadoop 2.7+.

2. Apache ORC 2.0 uses Hadoop 3.x shaded client doesn't need old shims for Hadoop 2.6 and olders.
    - apache#1509

In addition, Apache Spark community also has been using the shaded Hadoop client since Spark 3.2 (SPARK-33212) and dropped `Hadoop 2` profile via [SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452) completely at Spark 3.5.0.

### How was this patch tested?

Pass the CIs.

Closes apache#1724 from dongjoon-hyun/ORC-1569.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: William Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Mar 8, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants