Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Condition NOT support #3844

Merged
merged 8 commits into from
Mar 7, 2023
Merged

Conversation

abigalekim
Copy link
Contributor

@abigalekim abigalekim commented Jan 27, 2023

Implementation of QC NOT support.

Things changed:

  • CAPI has been mildly changed to allow QC not support. Will allow NULL right condition as long as the NOT is passed in.
  • Added negate call to negate the tree when NOT is passed in.

Things to do

  • Debug validate_qc_apply.
  • integration tests

TYPE: FEATURE
DESC: Query Condition NOT support

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #24503: Implement QueryCondition NOT support.

@abigalekim abigalekim requested a review from davisp January 27, 2023 21:20
@davisp davisp force-pushed the abigalekim/sc-24503/qc-not-support branch 2 times, most recently from 585c838 to 417ba87 Compare February 2, 2023 20:31
@ihnorton ihnorton requested a review from lums658 February 3, 2023 21:07
Copy link
Contributor

@lums658 lums658 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation somewhere should explain how NOT is realized with this implementation (though I suppose that is true for all of query_condition). At least the PR should explain a bit more than it is about NOT.

IIUC it is using DeMorgan's theorem to (recursively) convert a subtree to its complement. Kind of interesting how little needs to be done to implement NOT.

That being said -- the negate is applied while the AST is being built (in the C API). Is evaluating NOT while the tree is being built the best way to implement NOT? One thing we lose when doing it this way is that we can no longer recover the original query expression from the AST. Do we lose any ability to optimize or debug if we partially evaluate expressions this way?

Is support for NOT complete? Lines 1118, 1721, and 2429 in query_condition.cc still have cases in their switch statements that have NOT as an error (not currently supported). I suppose they don't really matter with this implementation of NOT, but in that case, there should be a different error message, because having NOT in the AST is impossible.

NOT will not be in the AST because it is evaluated while the tree is being built -- in the C API. Even if it seems like a good idea to partially evaluate NOT expressions, only being able to do that in the C API seems kind of limiting.

There are not very many unit tests (afaict) for NOT -- and there really should be many more. There do seem to be some tests of negate() and those only seem to be at the top level, i.e., negating an entire expression and looking at the results. But there aren't any with QueryConditionCombinationOp::NOT. There should be more complicated test expressions where NOT is not the root node. Similarly there should be tests of NOT NOT, although maybe those are really integration tests. But, per earlier comment, the NOT is not in the expression, so the tests have to consider a partially-evaluated AST, the NOT is gone.

Does NOT of a value ever make any sense? Do users expect to be able to? Perhaps on some type they are using as a bool?

delete (*combined_cond)->query_condition_;
delete *combined_cond;
return TILEDB_ERR;
if (combination_op == TILEDB_NOT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful here to have a comment that we are partially evaluating the tree and applying NOT via DeMorgan's theorem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

sanity_check(ctx, right_cond) == TILEDB_ERR)
(combination_op != TILEDB_NOT &&
sanity_check(ctx, right_cond) == TILEDB_ERR) ||
(combination_op == TILEDB_NOT && right_cond != nullptr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I like using one half of a binary operator to be a unary operator. OTOH, not sure if it is worth changing if all we are going to have is AND, OR, NOT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this same exact thought. My first reaction was to do away with the existing implementation and then provide a tiledb_query_condition_negate function. I polled the language binding crowd and the only responses I got were in favor of keeping things as is, reusing tiledb_query_condition_combine.

If I had it to do all over again, I would change the combine API from:

TILEDB_EXPORT int32_t tiledb_query_condition_combine(
    tiledb_ctx_t* ctx,
    const tiledb_query_condition_t* left_cond,
    const tiledb_query_condition_t* right_cond,
    tiledb_query_condition_combination_op_t combination_op,
    tiledb_query_condition_t** combined_cond) TILEDB_NOEXCEPT;

to:

TILEDB_EXPORT int32_t tiledb_query_condition_combine(
    tiledb_ctx_t* ctx,
    const tiledb_query_condition_t** conditions,
    size_t num_conditions,
    tiledb_query_condition_combination_op_t combination_op,
    tiledb_query_condition_t** combined_cond) TILEDB_NOEXCEPT;

The second definition removes the baked in 2-arity semantics of the combination_op so that NOT and >2-arity AND/OR operators would also be naturally expressed. I briefly considered combining this approach with #3814 but @eric-hughes-tiledb had some other concerns around the general approach of #3814 so I'm leaving it be for now.

CHECK(
tiledb::test::ast_node_to_str(combined_and.ast()) ==
"(x LT 12 ef cd ab AND y GT 33 33 33 33)");
check_ast_str(combined_and, "(x LT 12 ef cd ab AND y GT 33 33 33 33)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests negate() but I don't see anything in this file that tests NOT. It doesn't seem like we are actually testing expressions with NOT in them, but rather just testing negative() on different expressions, meaning what is being tested is only NOT at the root of a tree. I think we need to test more complicated uses of NOT than that.

If NOT is only going to be applicable from the C API, it needs to be tested there.

(My own opinion is that NOT should not be evaluated only in the C API and that we should be able to build expressions that have QueryConditionCombinationOp::NOT in them.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation/assumption that NOT can't exist in the AST, there's nothing extra to test. I do need to add C/C++ API tests to prove the the changes in tiledb.cc are correct. The current tests are just covering correctness.

@davisp
Copy link
Contributor

davisp commented Feb 6, 2023

Documentation somewhere should explain how NOT is realized with this implementation (though I suppose that is true for all of query_condition). At least the PR should explain a bit more than it is about NOT.

That's fine. Though, I'm not sure where best to put that. I can certainly add a comment in the code along with a similar write up in the PR for the code history. If there's another place in documentation about query conditions and the AST that needs updating I can certainly do that as well. Though I'm guessing this sort of documentation is fairly diffuse in each of the language binding implementations of query conditions.

IIUC it is using DeMorgan's theorem to (recursively) convert a subtree to its complement. Kind of interesting how little needs to be done to implement NOT.

Correct, its a standard negation.

That being said -- the negate is applied while the AST is being built (in the C API). Is evaluating NOT while the tree is being built the best way to implement NOT?

Not only do I have no idea, I suspect this is vaguely unanswerable in a philosophical manner. I would downgrade to a simple "Not off the top of my head" if we relaxed it to "Can you think of a better way to implement NOT?" though.

One thing we lose when doing it this way is that we can no longer recover the original query expression from the AST. Do we lose any ability to optimize or debug if we partially evaluate expressions this way?

This also seems a bit philosophical. Given the existing implementation, when we receive the negated AST, as far as the library is concerned, that is the original query. I'd also push back (at least within the given implementation) that this is a partial evaluation. The current implementation has two distinct phases: build the ast/query condition, then evaluate it. Currently, negation is a concept in the build phase and doesn't exist at all in the evaluation phase.

For comparison, when we AND two AND clauses together, i.e., AND(AND(A, B), AND(C, D)) we rewrite the AST such that its AND(A, B, C, D). I don't consider NOT's implementation to be different from this optimization/behavior.

Is support for NOT complete? Lines 1118, 1721, and 2429 in query_condition.cc still have cases in their switch statements that have NOT as an error (not currently supported). I suppose they don't really matter with this implementation of NOT, but in that case, there should be a different error message, because having NOT in the AST is impossible.

Define "finished", I guess? These three error messages are a bit belt and suspender-y (or belt and bracer-y for my UK English friends) defensive coding. Going back and preventing any of the constructors from allowing a NOT to exist in the AST and adding assertions/stdx::unreachable/std::logic_error exceptions might make things more clear, though functionally the same in that NOT should not exist in the AST (as things are currently implemented).

NOT will not be in the AST because it is evaluated while the tree is being built -- in the C API. Even if it seems like a good idea to partially evaluate NOT expressions, only being able to do that in the C API seems kind of limiting.

I'm not 100% certain I understand this part. Applying NOT conditions can be done from the internal API, the C API, and the C++ API (which should by extension cover all language bindings). This PR is just allowing for the usage in the C API which is then automatically available to the C++ API.

There are not very many unit tests (afaict) for NOT -- and there really should be many more. There do seem to be some tests of negate() and those only seem to be at the top level, i.e., negating an entire expression and looking at the results. But there aren't any with QueryConditionCombinationOp::NOT. There should be more complicated test expressions where NOT is not the root node. Similarly there should be tests of NOT NOT, although maybe those are really integration tests. But, per earlier comment, the NOT is not in the expression, so the tests have to consider a partially-evaluated AST, the NOT is gone.

For the test coverage part of this code, the reason it looks like there are so few and shallow tests is probably due to how I implemented the tests by re-using a bunch of the existing tests. Basically, I just gave the TestParams helper class a negate() implementation and then every test just runs the original and negated query conditions asserting the correct output. The nullable tests that require specifying a negated result bitmap might clarify this approach as they aren't a symmetrical negation (due to how we can't have inequality operators on NULL). The list of negated query conditions and their associated test conditions are then left implicit. The full list can be seen by reading the various versions of populate test params here, here, and here.

The NOT NOT tests are a good point. I'll add those to be explicit. Though, technically, I don't think they add anything other than an explicit assertion of correctness via demonstration of the isomorphism of De Morgan's laws?

The reason there aren't any tests with QueryConditionCombinationOp::NOT inside the AST is that its not possible to create them. The current implementation doesn't allow NOT in the AST. Thus the only place that can be negated is the root. Changing this is certainly possible, though that PR would likely be titled "Re-implement the entire query engine" or something similar. For now, given the current design, I don't see a benefit to relaxing the "no NOT in the AST" rule. I will always be open to suggestions, but even in a few cases I can think of, allowing NOT directly in the AST would be an optimization done behind the scenes. I'll write that thought out down below for clarity.

Does NOT of a value ever make any sense? Do users expect to be able to? Perhaps on some type they are using as a bool?

Currently, I don't think so because 100% of our conditionals are of the form $variable $op $value. Perhaps in the future, if we ever implement the $variable1 $op $variable2 ability to compare two attributes, having a negation on the value could make sense. Though, it also occurs to me that my_bool_attr_1 == !my_bool_attr2, is equivalent to my_bool_attr1 != my_bool_attr2, so it might also end up being an operator change regardless. Applying NOT to arbitrary (i.e., non boolean) types a la C/C++ if semantics seems like one of those "Maybe? Does SQL even allow that?" situations.

A thought on why all of the above might be completely wrong

One interesting thought I had while writing everything above is that our current AST/query condition implementation is significantly faster when applying AND combinations as opposed to OR combinations. If we allowed NOT into the AST, I'm pretty sure we could use De Morgan's laws to implement any arbitrary AST as a combination of AND and NOT instead of the current AND and OR approach. With the assumption that NOT would be easier/faster to implement than OR, I could see switching things around for that particular case. Come to think of it, I don't think we even need to allow NOT in the tree as the current implementation of NOT should work to satisfy De Morgan's laws. I may have just been nerd sniped into poking at the current implementation to rewrite the tree at evaluation time to see if that actually works.

@davisp
Copy link
Contributor

davisp commented Feb 6, 2023

tl;dr I’m pretty sure the counting bitmap aspect of apply sparse prevents NOT from being allowed in the AST for at least that evaluation path.

So, I tried implementing the NAND approach I described above and ran into two issues.

The first is that our handling of NULL comparisons is not symmetric with respect to negation. For instance, we can’t blindly invert the result of “foo < 5” when foo is null. Although we can blindly invert “foo == NULL” so we can’t also just blindly not invert when foo’s validity buffer is zero. I’m pretty sure this could be overcome with fairly minor changes to the existing implementation.

However, the second issue is how our “counting bitmaps” work in the apply sparse implementation. If I’m not mistaken, this logic isn’t invertible at all based on how the counters interact. Perhaps there’s a clever solution here, but I’m pretty sure we have an insurmountable issue around false results truncating our boolean signal by setting the result bitmap to zero.

For the first issue, I can see multiple approaches. Having a separate validity buffer that we carefully update in multiple places to indicate when negation is appropriate is the simplest. Then there’s an obvious optimization around merging the invertibility boolean array into the result bitmap by treating it as a 8 boolean flags per cell instead of uint8 values of zero or one. Granted, that approach doesn’t help at all for the sparse version.

For the sparse version, I could maybe see some sort of signed counting bitmap maybe something something perhaps? I’m not even convinced this is plausible. This half formed vague idea should probably be ignored. The end result being, I don’t think we can have negations in this evaluation path.

@KiterLuc KiterLuc self-requested a review February 7, 2023 14:09
@davisp davisp force-pushed the abigalekim/sc-24503/qc-not-support branch from 417ba87 to dd9ef69 Compare February 8, 2023 16:30
@davisp davisp self-assigned this Feb 22, 2023
@davisp davisp force-pushed the abigalekim/sc-24503/qc-not-support branch from dd9ef69 to 7396a8b Compare February 23, 2023 17:25
tiledb/sm/c_api/tiledb.h Outdated Show resolved Hide resolved
tiledb/sm/cpp_api/query_condition.h Outdated Show resolved Hide resolved
@davisp davisp force-pushed the abigalekim/sc-24503/qc-not-support branch from 7396a8b to 097aa8f Compare March 3, 2023 17:56
@davisp davisp force-pushed the abigalekim/sc-24503/qc-not-support branch from 097aa8f to 4902c00 Compare March 3, 2023 18:01
Copy link
Contributor

@lums658 lums658 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ihnorton ihnorton merged commit f2c1374 into dev Mar 7, 2023
@ihnorton ihnorton deleted the abigalekim/sc-24503/qc-not-support branch March 7, 2023 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants