Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601

Merged
merged 1 commit into from
Jan 18, 2022

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Jan 17, 2022

Which issue does this PR close?

Closes #1576 #165

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jan 17, 2022
@xudong963
Copy link
Member Author

xudong963 commented Jan 17, 2022

TPCH 8

➜  benchmarks git:(fix_cast) ✗ cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
 
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 8, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "./data", file_format: "tbl", mem_table: false }
Query 8 iteration 0 took 6401.8 ms
Query 8 iteration 1 took 5731.4 ms
Query 8 iteration 2 took 5830.1 ms
Query 8 avg time: 5987.78 ms

TPCH 14

➜  benchmarks git:(fix_cast) cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 14 --batch-size 4096
 
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 14, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "./data", file_format: "tbl", mem_table: false }
Query 14 iteration 0 took 4292.5 ms
Query 14 iteration 1 took 4464.5 ms
Query 14 iteration 2 took 2763.9 ms
Query 14 avg time: 3840.29 ms

@xudong963
Copy link
Member Author

One problem with this modification is that float64 to int causes a loss of accuracy. I think return_type decided by then data type is unreasonable.

I checked Postgres' behavior, it can process reasonably.

postgres=# SELECT * FROM test;
 a
---
 1
 2
 3
(3 rows)


postgres=# SELECT a,
       CASE WHEN a=1 THEN 1.2
            ELSE 0
       END
    FROM test;
 a | case
---+------
 1 |  1.2
 2 |    0
 3 |    0
(3 rows)


postgres=# SELECT a,
       CASE WHEN a=1 THEN 12
            ELSE 5.4
       END
    FROM test;
 a | case
---+------
 1 |   12
 2 |  5.4
 3 |  5.4
(3 rows)

@houqp
Copy link
Member

houqp commented Jan 18, 2022

I think return_type decided by then data type is unreasonable.

I agree, better to take both branches into account.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job unlocking more TPCH queries!

@houqp houqp added this to the TPC-H milestone Jan 18, 2022
@houqp houqp added the bug Something isn't working label Jan 18, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xudong963 !

@@ -324,7 +325,10 @@ impl CaseExpr {

// start with the else condition, or nulls
let mut current_value: Option<ArrayRef> = if let Some(e) = &self.else_expr {
Some(e.evaluate(batch)?.into_array(batch.num_rows()))
// keep `else_expr`'s data type and return type consistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually it would be great to separate out the coercion logic into the `

/// Create a CASE expression
pub fn case(
    expr: Option<Arc<dyn PhysicalExpr>>,
    when_thens: &[WhenThen],
    else_expr: Option<Arc<dyn PhysicalExpr>>,
) -> Result<Arc<dyn PhysicalExpr>> {
    Ok(Arc::new(CaseExpr::try_new(expr, when_thens, else_expr)?))
}

function the way it is done with BinaryExpr and binary_cast

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/expressions/binary.rs#L1106-L1114

But this is better than master so 👍

@alamb alamb merged commit 8ebc94c into apache:master Jan 18, 2022
@xudong963
Copy link
Member Author

I think return_type decided by then data type is unreasonable.

I agree, better to take both branches into account.

Next PR I'll do the thing and keep consistent with Postgres. cc @alamb

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

Filed #1609 to track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

casting Int64 to Float64 unsuccessfully caused tpch8 to fail
3 participants