chore: Make parquet reader options Comet options instead of Hadoop options #968

parthchandra · 2024-09-25T18:26:30Z

Which issue does this PR close?

Closes #876 .

Rationale for this change

Moves configuration options of the parallel reader into CometConf and adds documentation so that end users can easily understand the configuration of the reader.

How are these changes tested?

Existing unit test

…ons.

andygrove · 2024-09-25T21:56:48Z

common/src/test/java/org/apache/comet/parquet/TestFileReader.java

@@ -615,12 +617,12 @@ public void testColumnIndexReadWrite() throws Exception {
  @Test
  public void testWriteReadMergeScanRange() throws Throwable {
    Configuration conf = new Configuration();
-    conf.set(ReadOptions.COMET_IO_MERGE_RANGES, Boolean.toString(true));
+    conf.set(CometConf.COMET_IO_MERGE_RANGES().key(), Boolean.toString(true));


This test is setting the config values in a Hadoop Configuration still, and not setting them in the Spark config. Would it make sense to update the test?

There is no spark context in this test. I've added a new test with the configuration set thru the spark config

andygrove · 2024-09-25T21:58:05Z

common/src/main/java/org/apache/comet/parquet/ReadOptions.java

@@ -173,14 +147,24 @@ public Builder(Configuration conf) {
      this.conf = conf;
      this.parallelIOEnabled =
          conf.getBoolean(


This is reading from the Hadoop conf. If I set the new configs on my Spark context, how would they get propagated to the Hadoop conf?

Spark sql copies over configs that are not from spark into the hadoop config when the sql context is created. There are other settings that also use this ( e.g. COMET_USE_LAZY_MATERIALIZATION, COMET_SCAN_PREFETCH_ENABLED)

andygrove · 2024-10-01T14:44:09Z

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

+          withSQLConf(
+            CometConf.COMET_BATCH_SIZE.key -> batchSize.toString,
+            CometConf.COMET_IO_MERGE_RANGES.key -> "true",
+            CometConf.COMET_IO_MERGE_RANGES_DELTA.key -> mergeRangeDelta.toString) {


Is it possible to test that this config is actually making into ReadOptions? If I comment out all of the code in ReadOptions that reads these configs, the test still passes. Perhaps we just need a specific test to show that setting the config on a Spark context causes a change to the ReadOptions?

I added some debug logging and I do see that it is working correctly, but would be good to have a test to confirm (and prevent regressions)

test is setting COMET_IO_MERGE_RANGES_DELTA = 1048576 ReadOptions ioMergeRangesDelta = 1048576 test is setting COMET_IO_MERGE_RANGES_DELTA = 1024 ReadOptions ioMergeRangesDelta = 1024

I added an additional check for the config. The configuration passed in to read options is not accessible but I tried to simulate the next best thing. See if that makes sense.

Thanks, Parth. LGTM. Looks like you need to run make format to fix some import ordering.

Thanks @andygrove. Fixed style.

andygrove

LGTM with some comments on testing.

fix: Make parquet reader options Comet options instead of hadoop opti…

ce603ea

…ons.

andygrove reviewed Sep 25, 2024

View reviewed changes

Add additional test with configuration set thru spark sql config

0d396f7

andygrove reviewed Oct 1, 2024

View reviewed changes

andygrove approved these changes Oct 1, 2024

View reviewed changes

parthchandra added 2 commits October 1, 2024 15:40

Add configuration check in unit test

02ca31c

style fix

575bd78

andygrove merged commit 0667c60 into apache:main Oct 7, 2024
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Make parquet reader options Comet options instead of Hadoop options #968

chore: Make parquet reader options Comet options instead of Hadoop options #968

parthchandra commented Sep 25, 2024

andygrove Sep 25, 2024

parthchandra Sep 30, 2024

andygrove Sep 25, 2024

parthchandra Sep 30, 2024

andygrove Oct 1, 2024

andygrove Oct 1, 2024

parthchandra Oct 1, 2024

andygrove Oct 1, 2024

parthchandra Oct 2, 2024

andygrove left a comment

chore: Make parquet reader options Comet options instead of Hadoop options #968

chore: Make parquet reader options Comet options instead of Hadoop options #968

Conversation

parthchandra commented Sep 25, 2024

Which issue does this PR close?

Rationale for this change

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment