-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-34: implement not() for Contains predicate #2941
Conversation
throw new UnsupportedOperationException( | ||
"Contains predicates cannot be composed with non-Contains predicates"); | ||
} | ||
// If two Contains predicates refer to the same column, optimize by combining into a single predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a bug with the way Contains predicates were composed -- previously it would throw an error if you tried to do something like and(contains(binaryColumn("foo", ...), contains(binaryColumn("bar"), ...))
because the columns didn't match. however, that is still a valid predicate -- it's just that they should not be attempted to be combined into a single Contains
predicate.
/** | ||
* A ValueInspector implementation that keeps state for one or more delegate inspectors. | ||
*/ | ||
abstract static class DelegatingValueInspector extends ValueInspector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this because I thought the generated IncrementallyUpdatedFilterPredicateBuilder
was getting complex in terms of LOC -- all the Contains
ValueInspectors were overriding public void update(int|float|binary|double|... value)
as well as reset()
, and it was getting hard to reason about. However.... this is probably a bit less efficient, since the delegate ValueInspectors are accessed via iterator loops rather than directly invoking left
/right
. Let me know what you think, I'm happy to change it back to the way it was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IncrementallyUpdatedFilterPredicateGenerator.java
is pretty difficult to read and review. Since this is only applied to ContainsPredicate
and does not affect others, I'm fine to make this change.
@@ -371,7 +371,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) { | |||
|
|||
@Override | |||
public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { | |||
return contains.filter(this, IndexIterator::intersection, IndexIterator::union); | |||
return contains.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea was to implement IndexIterator::subtract
for not(contains())
-- for example, not(contains(eq("bar")))
would return something like IndexIterator.all(getPageCount()).subtract(visit(eq("bar"))
. But we can't rely on the individual predicate implementations to return exact row ranges -- for example, NotIn
always returns IndexIterator.all. So I don't think it's possible to safely subtract any row ranges here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we cannot make such subtraction unless it is a single-value page which has only bar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I just took an initial pass and has some questions overall.
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
Outdated
Show resolved
Hide resolved
/** | ||
* A ValueInspector implementation that keeps state for one or more delegate inspectors. | ||
*/ | ||
abstract static class DelegatingValueInspector extends ValueInspector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IncrementallyUpdatedFilterPredicateGenerator.java
is pretty difficult to read and review. Since this is only applied to ContainsPredicate
and does not affect others, I'm fine to make this change.
@@ -371,7 +371,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) { | |||
|
|||
@Override | |||
public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { | |||
return contains.filter(this, IndexIterator::intersection, IndexIterator::union); | |||
return contains.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we cannot make such subtraction unless it is a single-value page which has only bar
.
Visitor<R> visitor, | ||
BiFunction<R, R, R> andBehavior, | ||
BiFunction<R, R, R> orBehavior, | ||
Function<R, R> notBehavior); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, all kinds of filters return all rows if notBehavior
is found, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for example NotIn
👍
Co-authored-by: Gang Wu <[email protected]>
3974c69
to
1ca2207
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@gszadovszky Do you want to take a look? |
Merged. Thanks @clairemcginty and @gszadovszky! |
This should be the final subtask for the Contains() predicate work. It adds support for logical inversion --
not(contains(...))
.Also, it fixes a small bug with the composition of mixed Contains predicates.
Make sure you have checked all steps below.
Issue
them in the PR title. For example, "PARQUET-2410: Use row count instead of value count to get row count from OffsetIndex #1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation