Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support byte array for decimal in parquet page and row group filters #4742

Merged
merged 2 commits into from
Jan 2, 2023

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Dec 27, 2022
@liukun4515 liukun4515 marked this pull request as ready for review December 31, 2022 13:00
@liukun4515 liukun4515 requested a review from alamb December 31, 2022 13:01
@alamb alamb changed the title support byte array for decimal in the filter support byte array for decimal in parquet page filter Jan 1, 2023
@alamb alamb changed the title support byte array for decimal in parquet page filter support byte array for decimal in parquet page and row group filters Jan 1, 2023
@@ -627,6 +638,56 @@ mod tests {
);

// TODO: BYTE_ARRAY support read decimal from parquet, after the 20.0.0 arrow-rs release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment can probably be removed as well

.collect();
Some(Arc::new(array))
}
Index::BYTE_ARRAY(index) => match $self.target_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this code is covered? Is there some test that does page filtering on Decimal values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because there is no UT for the the page filter, but I can add some test cases in the follow up pr for the page filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false,
)],
);
let rgm2 = get_row_group_meta_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth adding a test for row group metadata that has nulls for min and / or max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you means we need add null value in the decimal column to check min/max?
I think we can refactor the test cases to do that in the follow up pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth adding a test for row group metadata that has nulls for min and / or max?

do you means the min or the max value is null?

Copy link
Contributor Author

@liukun4515 liukun4515 Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add the metadata for one row group like bellow

vec![ParquetStatistics::int32(
                None,
                Some(600),
                None,
                0,
                false,
            )],

or

vec![ParquetStatistics::int32(
                Some(100),
                None,
                None,
                0,
                false,
            )],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you means the min or the max value is null?

Yes this is what I had in mind

@alamb
Copy link
Contributor

alamb commented Jan 1, 2023

Thank you @liukun4515

@liukun4515 liukun4515 requested a review from alamb January 2, 2023 09:39
@alamb alamb merged commit f91f623 into apache:master Jan 2, 2023
@ursabot
Copy link

ursabot commented Jan 2, 2023

Benchmark runs are scheduled for baseline = 54ae432 and contender = f91f623. f91f623 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants