Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet pruning when column names have periods #5710

Merged
merged 1 commit into from
Mar 24, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Mar 23, 2023

Which issue does this PR close?

Closes #5708

Rationale for this change

We were seeing wrong results with parquet file that had columns with . in their names

What changes are included in this PR?

Bug fix + test

Are these changes tested?

yes

I also verified that this fixes the issue we saw upstream in IOx https://github.com/influxdata/influxdb_iox/issues/7225#issuecomment-1481809300

Are there any user-facing changes?

bug fix

cc @crepererum

@github-actions github-actions bot added the core Core DataFusion crate label Mar 23, 2023
@@ -384,7 +384,7 @@ fn build_statistics_record_batch<S: PruningStatistics>(
let mut arrays = Vec::<ArrayRef>::new();
// For each needed statistics column:
for (column, statistics_type, stat_field) in required_columns.iter() {
let column = Column::from_qualified_name(column.name());
let column = Column::from_name(column.name());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the bugfix

The rest of the PR is a test

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@andygrove andygrove added the bug Something isn't working label Mar 24, 2023
Comment on lines +469 to +470
// note the column name has a period in it!
Field::new("service.name", service_name.data_type().clone(), true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Scenario::PeriodsInColumnNames,
"SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend' AND \"name\" != 'HTTP GET / DISPATCH'",
Some(0),
Some(2), // prune out middle and last row group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit:

Suggested change
Some(2), // prune out middle and last row group
Some(2), // prune out middle and last row group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry -- missed that -- will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included in #5726

@alamb alamb merged commit 74c3955 into apache:main Mar 24, 2023
@alamb alamb deleted the alamb/fix_stats branch March 24, 2023 16:28
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should stop using strings for columns at all to prevent this type of confusion in the future. IIRC this isn't the first bug introduces by it.

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2023

I wonder if we should stop using strings for columns at all to prevent this type of confusion in the future. IIRC this isn't the first bug introduces by it.

For what it is worth, postgres uses all indexes (numbers) as I recall and that has its own set of horrible bugs (when the output indexes get mixed up).

richox pushed a commit to richox/arrow-datafusion that referenced this pull request Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[REGRESSION] Parquet row group pruning incorrectly prunes out row groups when columns names have . in them
6 participants