Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ANSI support for Add #616

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

planga82
Copy link
Contributor

@planga82 planga82 commented Jul 1, 2024

Which issue does this PR close?

Closes #536 .

Rationale for this change

This PR adds ANSI support for Add operator. This is done by adding a wrapper to BinaryExpr to add the different behavior between Spark and Datafusion.
The wrapper is based on #593 because both PRs solve similar problems.

What changes are included in this PR?

The implementation is based on Java Math.addExact(a, b) because it is the function that uses Spark to solve this problem, but in this case using datafusion operators.

 public static int addExact(int x, int y) {
        // HD 2-12 Overflow iff both arguments have the opposite sign of the result
         int r = x + y;
         if (((x ^ r) & (y ^ r)) < 0) {
            throw new ArithmeticException("integer overflow");
         }
         return r;
    }

This PR excludes two things that I will do in subsequent PRs to avoid make this PR more complex:

  • Support for DecimalType overflow check
  • Spark try_add mode

How are these changes tested?

Unit testing

@planga82
Copy link
Contributor Author

planga82 commented Jul 1, 2024

@dharanad Here is my solution based on your PR.

@dharanad
Copy link
Contributor

dharanad commented Jul 2, 2024

@dharanad Here is my solution based on your PR.

This look fine, once this PR is merged. I will extend my solution to this to solve #535 .
@andygrove / @viirya Can you please help us with a review

let boolean_array = array
.as_any()
.downcast_ref::<BooleanArray>()
.expect("Expected BooleanArray");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using expect here this function may panic, can instead return an errors ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved! thanks

}

fn check_int_overflow(&self, batch: &RecordBatch, result: &ColumnarValue) -> Result<()> {
let check_overflow_expr = Arc::new(BinaryExpr::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since arrow provide overflow checked kernels (apache/arrow-rs#2643) does it makes sense to use those directly rather than re-implementing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to review it. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eejbyfeldt , I've been looking at datafusion and I don't see any options to use those arrow operations from datafusion physical expressions. Do you know if this is implemented yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@planga82 Can we make use of this arithmetic kernel https://docs.rs/arrow/latest/arrow/compute/kernels/numeric/fn.add.html to compute the addition and throw error based on the result.

Copy link
Contributor

@dharanad dharanad Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this ?

    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        use arrow::compute::kernels::numeric::add;
        let lhs = self.left.evaluate(batch)?;
        let rhs = self.left.evaluate(batch)?;
        match (self.op, self.eval_mode) {
            (Operator::Plus, EvalMode::Ansi)  => {
              return apply(&lhs, &rhs, add)
            },
            _ => {
                self.inner.evaluate(batch)
            }
        }
    }

But the visibility of apply fn is more restrcited pub (crate). We might need to raise a PR with datafusion to make it public

Copy link
Contributor Author

@planga82 planga82 Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that if we directly use the kernels to perform the operations instead of reusing the physical datafusion expression we may lose functionality or have to reimplement it here.
From my point of view in comet we should try to translate from Spark to Datafusuion and add in the form of Wrapper the functionality that may be missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well put. I agree to what you are saying. I was thinking to override the implementation but your thoughts makes much more sense, safer & cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @eejbyfeldt that we should just use the existing add_checked or add_wrapped kernels in arrow-rs that already provide the functionality that we need (unless we discover any compatibility issue compared to the JVM addExact logic). I will create an example to demonstrate how to use this and will post here later today

@andygrove
Copy link
Member

Thanks for the contribution @planga82. I am reviewing this today.

Comment on lines +155 to +157
match self.inner.evaluate(batch) {
Ok(result) => {
self.fail_on_overflow(batch, &result)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluating the same expression twice is going to be expensive. We should just evaluate once, either checking for overflows, or not, depending on eval mode.

Comment on lines +75 to +91
Arc::new(BinaryExpr::new(
Arc::new(BinaryExpr::new(
self.left.clone(),
Operator::BitwiseXor,
self.inner.clone(),
)),
Operator::BitwiseAnd,
Arc::new(BinaryExpr::new(
self.right.clone(),
Operator::BitwiseXor,
self.inner.clone(),
)),
)),
Operator::Lt,
Self::zero_literal(&result.data_type())?,
));
match check_overflow_expr.evaluate(batch)? {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very expensive way of implementing this. We don't need to use DataFusion to perform simple math operations when we can just implement this in Rust directly as we process the arrays. I think we can delegate to arrow-rs though and avoid all of this. As stated in another comment, I will provide an example later today.

@andygrove
Copy link
Member

@planga82 Please see apache/datafusion#11400 that I just created against DataFusion, which possibly gives us most of what we need, although the same principals could be applied directly in Comet.

They key change was adding a fail_on_overflow config in BinaryExpr and then choosing to call either add or add_wrapped depending on that value.

@dharanad
Copy link
Contributor

dharanad commented Jul 11, 2024

@planga82 Please see apache/datafusion# v that I just created against DataFusion, which possibly gives us most of what we need, although the same principals could be applied directly in Comet.

They key change was adding a fail_on_overflow config in BinaryExpr and then choosing to call either add or add_wrapped depending on that value.

This is amazing, thanks for apache/datafusion#11400 . I have considered this idea, but I wasn't sure how to suggest this change.

Correct me if i am wrong, so this change will go in the next datafusion release and we are blocked on supporting ANSI until then ? Or can we plan to update datafusion dep updated once 11400 is merged ?

@planga82
Copy link
Contributor Author

planga82 commented Jul 11, 2024

Thank you very much @andygrove for the explanation and the PR. In addition to @dharanad question

In Spark we have three different behaviors

ANSI mode disable --> Same behavior as Datafusion, return overflow value
ANSI mode enabled

  • x + y : Fail with overflow message
  • try_add(x, y): Return null value on overflow

Should we implement this try_add behavior in Datafusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ANSI support for Add
4 participants