You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix parquet predicate filtering with column projection (#15113)
Fixes#15051
The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.
With this PR change, the user will be able to use both column name reference and column index reference in the filter.
- column name reference: the filters may specify any columns by name even if they are not present in column projection.
- column reference (index): The indices used should be the indices of output columns in the requested order.
This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.
Summary of changes:
- `get_column_names_in_expression` extracts column names in filter.
- The extra columns in filter are added to output buffers during reader initialization
- `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp`
- instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter)
- `cpp/src/io/parquet/predicate_pushdown.cpp`
- To do this, output column schemas and its dtypes should be cached.
- statistics data extraction code is updated to check for `schema_idx` in row group metadata.
- No need to convert filter again for all root columns, reuse the passed output columns reference filter.
- Rest of the code is same.
- After the output filter predicate is calculated, these filter-only columns are removed
- moved `named_to_reference_converter` constructor to cpp, and remove used constructor.
- small include<> cleanup
Authors:
- Karthikeyan (https://github.com/karthikeyann)
- Vukasin Milovanovic (https://github.com/vuule)
- Muhammad Haseeb (https://github.com/mhaseeb123)
Approvers:
- Lawrence Mitchell (https://github.com/wence-)
- Vukasin Milovanovic (https://github.com/vuule)
- Muhammad Haseeb (https://github.com/mhaseeb123)
URL: #15113
0 commit comments