Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REGRESSION] Parquet row group pruning incorrectly prunes out row groups when columns names have . in them #5708

Closed
alamb opened this issue Mar 23, 2023 · 0 comments · Fixed by #5710
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Mar 23, 2023

Describe the bug

Parquet row group pruning incorrectly prunes out row groups when columns names have . in them

To Reproduce

Use this file: spans.zip

Run using datafusion-cli:

SELECT "service.name" FROM 'spans.parquet';
+--------------+
| service.name |
+--------------+
| frontend     |
+--------------+
1 row in set. Query took 0.002 seconds.
❯ SELECT "service.name" FROM 'spans.parquet' WHERE "service.name" = 'frontend'
0 rows in set. Query took 0.002 seconds.

Expected behavior

However if I disable row group pruning the same query works as expected and returns a single row

set datafusion.execution.parquet.pruning=false;
0 rows in set. Query took 0.000 seconds.
❯ SELECT "service.name" FROM 'spans.parquet' WHERE "service.name" = 'frontend';
+--------------+
| service.name |
+--------------+
| frontend     |
+--------------+

Additional context

@jacobmarble found this in IOx: https://github.com/influxdata/influxdb_iox/issues/7225

And has identified that it was a regression introduced in #5419 (see https://github.com/influxdata/influxdb_iox/issues/7225#issuecomment-1472546654) ❤️

Note this will only generate wrong results because there is a column named "name" and "service.name" in the same file (because the pruning logic incorrectly uses the statistics for "name" for the predicate on "service.name"

If there were no column named "name" the predicate would fail to resolve the statistics and they would be ignored

@alamb alamb added the bug Something isn't working label Mar 23, 2023
@alamb alamb self-assigned this Mar 23, 2023
@alamb alamb changed the title Parquet row group pruning incorrectly prunes out row groups when columns names have . in them [REGRESSION] Parquet row group pruning incorrectly prunes out row groups when columns names have . in them Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant