Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: incorrect nullability of InList expr #6799

Merged
merged 2 commits into from
Jul 1, 2023
Merged

fix: incorrect nullability of InList expr #6799

merged 2 commits into from
Jul 1, 2023

Conversation

jonahgao
Copy link
Member

Which issue does this PR close?

None

Rationale for this change

Similar to #6786 .

We can safely assume that the InList expr is non-nullalbe only if all of its subexpressions are non-nullable.

An example of a nullable expression is:

DataFusion CLI v27.0.0

❯ select 1 in(2, null);
+----------------------------------------------------------------------+
| Int64(1) IN (Map { iter: Iter([Literal(Int64(2)), Literal(NULL)]) }) |
+----------------------------------------------------------------------+
|                                                                      |
+----------------------------------------------------------------------+

What changes are included in this PR?

  • fix incorrect nullability of InList expr
  • cleaner code for insert.rs

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Jun 29, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nicely written and tested @jonahgao 👏 Thank you very much

Improve code readability

Co-authored-by: Andrew Lamb <[email protected]>

Expr::InList(InList { expr, list, .. }) => {
// Avoid inspecting too many expressions.
const MAX_INSPECT_LIMIT: usize = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to know why use this number? Is it the practice from other systems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark handle it simplified.

override def nullable: Boolean = children. Exists(_.nullable)

Copy link
Member Author

@jonahgao jonahgao Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to know why use this number? Is it the practice from other systems?

@jackwener No. The nullable function may be called multiple times during the optimization phase,
So I think adding a limitation would be preferable in order to prevent it from being excessively slow.
But I'm not quite sure what would be an appropriate number.

spark handle it simplified.

override def nullable: Boolean = children. Exists(_.nullable)

This seems to be a cache style.
We can implement this by precomputing nullable in the InList::new() function.
But the disadvantages are:

  • We need a new field for the InList struct
  • The precomputed nullable may not be used.
  • Calculating nullable requires input_schema

@jackwener Which solution do you prefer?

Update: It seems challenging to accomplish. I need to take a closer look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jonahgao hints at , trying to add some sort of cache / memoization is going to be challenging given Rust and how DataFusion is structured

I think the current PR solution is good:

  1. It is well tested
  2. It is conservative (it might say an InList is nullable that isn't, but I think that will not generate wrong results, only potentially less optimal plans)

Thus I think we should merge this PR as is and then as a follow on PR we can remove the limit or increase it, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jonahgao @alamb

@jackwener jackwener merged commit 84832ac into apache:main Jul 1, 2023
@jonahgao jonahgao deleted the inlist_nullability branch July 1, 2023 11:27
yukkit pushed a commit to cnosdb/arrow-datafusion that referenced this pull request Jul 5, 2023
* fix: incorrect nullability of InList expr

* Update datafusion/expr/src/expr_schema.rs

Improve code readability

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023
* fix: incorrect nullability of InList expr

* Update datafusion/expr/src/expr_schema.rs

Improve code readability

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants