Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-34: implement not() for Contains predicate #2941

Merged
merged 5 commits into from
Jul 30, 2024

Conversation

clairemcginty
Copy link
Contributor

This should be the final subtask for the Contains() predicate work. It adds support for logical inversion -- not(contains(...)).

Also, it fixes a small bug with the composition of mixed Contains predicates.

Make sure you have checked all steps below.

Issue

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference GitHub issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including GitHub issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

throw new UnsupportedOperationException(
"Contains predicates cannot be composed with non-Contains predicates");
}
// If two Contains predicates refer to the same column, optimize by combining into a single predicate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug with the way Contains predicates were composed -- previously it would throw an error if you tried to do something like and(contains(binaryColumn("foo", ...), contains(binaryColumn("bar"), ...)) because the columns didn't match. however, that is still a valid predicate -- it's just that they should not be attempted to be combined into a single Contains predicate.

/**
* A ValueInspector implementation that keeps state for one or more delegate inspectors.
*/
abstract static class DelegatingValueInspector extends ValueInspector {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this because I thought the generated IncrementallyUpdatedFilterPredicateBuilder was getting complex in terms of LOC -- all the Contains ValueInspectors were overriding public void update(int|float|binary|double|... value) as well as reset(), and it was getting hard to reason about. However.... this is probably a bit less efficient, since the delegate ValueInspectors are accessed via iterator loops rather than directly invoking left/right. Let me know what you think, I'm happy to change it back to the way it was

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncrementallyUpdatedFilterPredicateGenerator.java is pretty difficult to read and review. Since this is only applied to ContainsPredicate and does not affect others, I'm fine to make this change.

@@ -371,7 +371,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) {

@Override
public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) {
return contains.filter(this, IndexIterator::intersection, IndexIterator::union);
return contains.filter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was to implement IndexIterator::subtract for not(contains()) -- for example, not(contains(eq("bar"))) would return something like IndexIterator.all(getPageCount()).subtract(visit(eq("bar")). But we can't rely on the individual predicate implementations to return exact row ranges -- for example, NotIn always returns IndexIterator.all. So I don't think it's possible to safely subtract any row ranges here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we cannot make such subtraction unless it is a single-value page which has only bar.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I just took an initial pass and has some questions overall.

/**
* A ValueInspector implementation that keeps state for one or more delegate inspectors.
*/
abstract static class DelegatingValueInspector extends ValueInspector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncrementallyUpdatedFilterPredicateGenerator.java is pretty difficult to read and review. Since this is only applied to ContainsPredicate and does not affect others, I'm fine to make this change.

@@ -371,7 +371,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) {

@Override
public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) {
return contains.filter(this, IndexIterator::intersection, IndexIterator::union);
return contains.filter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we cannot make such subtraction unless it is a single-value page which has only bar.

Visitor<R> visitor,
BiFunction<R, R, R> andBehavior,
BiFunction<R, R, R> orBehavior,
Function<R, R> notBehavior);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, all kinds of filters return all rows if notBehavior is found, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for example NotIn 👍

@clairemcginty clairemcginty force-pushed the parquet-34-not-contains branch from 3974c69 to 1ca2207 Compare July 10, 2024 13:22
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wgtmac
Copy link
Member

wgtmac commented Jul 25, 2024

@gszadovszky Do you want to take a look?

@wgtmac wgtmac merged commit ea3d273 into apache:master Jul 30, 2024
9 checks passed
@wgtmac
Copy link
Member

wgtmac commented Jul 30, 2024

Merged. Thanks @clairemcginty and @gszadovszky!

@wgtmac wgtmac added this to the 1.15.0 milestone Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants