Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Make parquet reader options Comet options instead of Hadoop options #968

Merged
merged 4 commits into from
Oct 7, 2024

Conversation

parthchandra
Copy link
Contributor

Which issue does this PR close?

Closes #876 .

Rationale for this change

Moves configuration options of the parallel reader into CometConf and adds documentation so that end users can easily understand the configuration of the reader.

How are these changes tested?

Existing unit test

@@ -615,12 +617,12 @@ public void testColumnIndexReadWrite() throws Exception {
@Test
public void testWriteReadMergeScanRange() throws Throwable {
Configuration conf = new Configuration();
conf.set(ReadOptions.COMET_IO_MERGE_RANGES, Boolean.toString(true));
conf.set(CometConf.COMET_IO_MERGE_RANGES().key(), Boolean.toString(true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is setting the config values in a Hadoop Configuration still, and not setting them in the Spark config. Would it make sense to update the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no spark context in this test. I've added a new test with the configuration set thru the spark config

@@ -173,14 +147,24 @@ public Builder(Configuration conf) {
this.conf = conf;
this.parallelIOEnabled =
conf.getBoolean(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reading from the Hadoop conf. If I set the new configs on my Spark context, how would they get propagated to the Hadoop conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark sql copies over configs that are not from spark into the hadoop config when the sql context is created. There are other settings that also use this ( e.g. COMET_USE_LAZY_MATERIALIZATION, COMET_SCAN_PREFETCH_ENABLED)

withSQLConf(
CometConf.COMET_BATCH_SIZE.key -> batchSize.toString,
CometConf.COMET_IO_MERGE_RANGES.key -> "true",
CometConf.COMET_IO_MERGE_RANGES_DELTA.key -> mergeRangeDelta.toString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test that this config is actually making into ReadOptions? If I comment out all of the code in ReadOptions that reads these configs, the test still passes. Perhaps we just need a specific test to show that setting the config on a Spark context causes a change to the ReadOptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some debug logging and I do see that it is working correctly, but would be good to have a test to confirm (and prevent regressions)

test is setting COMET_IO_MERGE_RANGES_DELTA = 1048576
ReadOptions ioMergeRangesDelta = 1048576
test is setting COMET_IO_MERGE_RANGES_DELTA = 1024
ReadOptions ioMergeRangesDelta = 1024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional check for the config. The configuration passed in to read options is not accessible but I tried to simulate the next best thing. See if that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Parth. LGTM. Looks like you need to run make format to fix some import ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove. Fixed style.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments on testing.

@andygrove andygrove merged commit 0667c60 into apache:main Oct 7, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation for COMET_PARQUET_PARALLEL_IO_ENABLED
2 participants