-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive predicate pushdown not working with multiple filters #21472
Comments
The problem with doing this in the query plan is that things like |
So is this Regarding |
Oh! My bad! The issue I was describing was a more future technical issue. The reason here why we are not able to skip items there is that we don't have Basically, we need to map:
|
This PR adds a fallback for skip batch predicate if we don't have a better specialized implementation. Namely any expression that does not have a better fallback now gets lowered to: ``` E -> all(col(A_min) == col(A_max) & col(A_nc) == 0 for A in LIVE(E)) & ~(E) ``` This basically means that if the predicate columns are constant for a batch, we now are always accurately predict whether we can skip it. Specifically, this makes pruning hive partitions much more consistent and potent. Fixes pola-rs#21472.
Thanks! |
Version: Polars 1.23.0
Description
I have a parquet dataset in the usual Hive format partitioned by a
pl.Date
field, e.g.:data.parquet/date=yyyy-mm-dd/*.parquet
. If I load the data lazily by filtering on the partition columndate
, I can see that the query only requires reading a single file (as expected):Output:
This is what I expect: it only needs to read a single file.
However, if I add an additional filter to the query, it seems to ignore the Hive partitioning and attempt to read all files in dataset:
Output:
My expectation is that it would still only need to read the single relevant file because we are filtering on the partition column, and that the second filter would apply on the data within the single file. Is my understanding wrong? If so, what can I do to avoid reading all files in queries like the example above?
Thanks.
The text was updated successfully, but these errors were encountered: