-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iceberg/Comet integration POC #9841
Changes from 26 commits
e7a9d8f
5660306
11bb4b7
4b01161
e73dc0a
d055ca2
4f06c97
d96183d
a2a3707
193a85b
e68bbb5
f2fbb3c
fa0ee52
5514f86
1eb40e5
83196f6
d1c6a14
0eb4ce7
b9ca9f3
d552d4a
46d0170
9db707d
d61325b
e173dd3
a6b15d3
77775a3
4bf5cbf
8f34742
0d9e974
46dd439
10901b0
dae79ad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ | |
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException; | ||
import org.junit.Assert; | ||
import org.junit.Before; | ||
import org.junit.Ignore; | ||
import org.junit.Test; | ||
|
||
public class SmokeTest extends SparkExtensionsTestBase { | ||
|
@@ -44,7 +45,7 @@ public void dropTable() { | |
// Run through our Doc's Getting Started Example | ||
// TODO Update doc example so that it can actually be run, modifications were required for this | ||
// test suite to run | ||
@Test | ||
@Ignore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we will have to use the built-in reader by default. |
||
public void testGettingStarted() throws IOException { | ||
// Creating a table | ||
sql("CREATE TABLE %s (id bigint, data string) USING iceberg", tableName); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.iceberg.spark; | ||
|
||
import java.io.Serializable; | ||
import org.immutables.value.Value; | ||
|
||
@Value.Immutable | ||
public interface OrcBatchReadConf extends Serializable { | ||
int batchSize(); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.iceberg.spark; | ||
|
||
import java.io.Serializable; | ||
import org.immutables.value.Value; | ||
|
||
@Value.Immutable | ||
public interface ParquetBatchReadConf extends Serializable { | ||
int batchSize(); | ||
|
||
ParquetReaderType readerType(); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.iceberg.spark; | ||
|
||
import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
|
||
/** Enumerates the types of Parquet readers. */ | ||
public enum ParquetReaderType { | ||
huaxingao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/** ICEBERG type utilizes the built-in Parquet reader. */ | ||
ICEBERG("iceberg"), | ||
|
||
/** | ||
* COMET type changes the Parquet reader to the Apache DataFusion Comet Parquet reader. Comet | ||
* Parquet reader performs I/O and decompression in the JVM but decodes in native to improve | ||
* performance. Additionally, Comet will convert Spark's physical plan into a native physical plan | ||
* and execute this plan natively. | ||
* | ||
* <p>TODO: Implement {@link org.apache.comet.parquet.SupportsComet} in SparkScan to convert Spark | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comet checks SupportsComet. isCometEnabled() and wraps BatchScanExec with CometBatchScanExec if isCometEnabled is true. I will make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this marker interface avoid the need to depend on Iceberg classes? Okay, that makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have to include this TODO here, however? It doesn't seem to belong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to have a TODO note somewhere so that people will know that native execution is not yet supported. There are some additional steps we need to take to make native execution work. Otherwise people may think native execution is enabled by this PR. Let me see if there is a better place for this. |
||
* physical plan to native physical plan for native execution. | ||
*/ | ||
COMET("comet"); | ||
|
||
private final String parquetReaderType; | ||
|
||
ParquetReaderType(String readerType) { | ||
this.parquetReaderType = readerType; | ||
} | ||
|
||
public static ParquetReaderType fromName(String parquetReaderType) { | ||
Preconditions.checkArgument(parquetReaderType != null, "Parquet reader type is null"); | ||
|
||
if (ICEBERG.parquetReaderType().equalsIgnoreCase(parquetReaderType)) { | ||
return ICEBERG; | ||
|
||
} else if (COMET.parquetReaderType().equalsIgnoreCase(parquetReaderType)) { | ||
return COMET; | ||
|
||
} else { | ||
throw new IllegalArgumentException("Unknown parquet reader type: " + parquetReaderType); | ||
} | ||
} | ||
|
||
public String parquetReaderType() { | ||
aokolnychyi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return parquetReaderType; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are imports of shaded classes in
CometColumnReader
. Are those Comet classes? Can you explain a bit what exactly is shaded?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comet shades arrow, protobuf and guava.
RootAllocator
is an arrow class.CometSchemaImporter
is a Comet classThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish Comet would offer an API that wraps around shaded dependencies, so that we don’t have to reference shaded classes directly. It is a bit odd. I always considered shading an internal detail rather than something that would leak into a public API.
Thoughts, @RussellSpitzer @Fokko @nastra @amogh-jahagirdar @danielcweeks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shaded imports can be removed. Comet has an API used here that requires these classes but we can change the API (only this integration uses that API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao can you log an issue in Comet to address this?
CometSchemaImporter
is a Comet class but is in theorg.apache.arrow.c
package to overcome access restrictions (Arrow's SchemaImporter is package private). We can create a wrapper class to access the schema importer.Also, we should ideally use the allocator from
BatchReader
, but that too can be in the wrapper class, I think. There is no issue with using a new allocator for each column, but the arrow allocator has powerful features in memory accounting that we can take advantage of down the road.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created apache/datafusion-comet#1352 for this issue. Will fix this in the next minor release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!