Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait: InList support for strings #6410

Closed
nseekhao opened this issue May 22, 2023 · 5 comments
Closed

Substrait: InList support for strings #6410

nseekhao opened this issue May 22, 2023 · 5 comments
Labels
enhancement New feature or request substrait Changes to the substrait crate

Comments

@nseekhao
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently, DataFusion turns InList into OR/EQUAL expressions only when the columns is numeric. If the columns is type string, it kept the expression as InList. Thus, running the producer with a sql string:

select int_col, double_col from alltypes_plain where CAST(date_string_col as VARCHAR) in ('a', 'b', 'c')

produces error:

Error: NotImplemented("Unsupported expression: customer_address.ca_county IN ([Utf8(\"Vermilion County\"), Utf8(\"Park County\"), Utf8(\"Dorchester County\"), Utf8(\"Republic County\"), Utf8(\"Hayes County\")])")

However, if running with sql string:

select int_col, double_col from alltypes_plain where int_col in (1, 2, 3)

or

select int_col, double_col from alltypes_plain where double_col in (1.0, 2.0, 3.0)

the producer works correctly.

Describe the solution you'd like

Support for InList for string type.

Describe alternatives you've considered

If DataFusion logical plan optimizer turns InList (string) into OR/EQUAL expressions, then this feature from substrait would not be needed.

Additional context

No response

@nseekhao nseekhao added the enhancement New feature or request label May 22, 2023
@alamb
Copy link
Contributor

alamb commented May 22, 2023

FWIW I think there is a limit above which DataFusion will not inline literals either: https://github.com/apache/arrow-datafusion/blob/dd3a003c1ca4e2109f33277d13f2b0b2fa500337/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L416-L429

So Substrait probably needs to implement InList anyways, though expanding out InList for string might also still be a good idea

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 30, 2023

I fail to reproduce the error.
I add this test in intersection.slt and run cargo test --test sqllogictests and there is no error.

query I
select int_col from alltypes_plain where CAST(date_string_col as VARCHAR) in ('a', 'b', 'c')
----

Also, adding the test to the roundtrip logical plan with DataType::UTF8 is also fine. Only failed on the test when len of the list is over THRESHOLD_INLINE_INLIST (3).

// PASS
#[tokio::test]
    async fn roundtrip_inlist_1() -> Result<()> {
        roundtrip("SELECT * FROM data WHERE f IN ('a', 'b', 'c')").await
    }
// FAIL
#[tokio::test]
    async fn roundtrip_inlist_2() -> Result<()> {
        roundtrip("SELECT * FROM data WHERE f IN ('a', 'b', 'c', 'd')").await
    }
//ERROR LOG
//Error: NotImplemented("Unsupported expression: data.f IN ([Utf8(\"a\"), Utf8(\"b\"), Utf8(\"c\"), Utf8(\"d\")])")

Updated data.csv

a,b,c,d,e,f
1,2.0,2020-01-01,false,4294967296,'a'
3,4.5,2020-01-01,true,2147483648,'b'

Updated Schema

let schema = Schema::new(vec![
            Field::new("a", DataType::Int64, true),
            Field::new("b", DataType::Decimal128(5, 2), true),
            Field::new("c", DataType::Date32, true),
            Field::new("d", DataType::Boolean, true),
            Field::new("e", DataType::UInt32, true),
            Field::new("f", DataType::Utf8, true),
        ]);

Therefore I have two questions,

  1. How to reproduce the error you mention, if it is not the way I add the test tosqllogictests, it would be nice if the command and the path to add the test case are provided.
  2. We should support the Expr:InList to Substrait when the length of the list in InList is over THRESHOLD_INLINE_INLIST, in other words, pass roundtrip_inlist_2(), right?

@nseekhao
Copy link
Contributor Author

nseekhao commented Jun 8, 2023

Thank you @jayzhan211 for looking into this issue. First of all, I want to apologize for the late reply. I also reread my issue description and there were a bit of a mixup in the example since I was testing multiple cases. Having said that I think there are a few things I should clarify.

It seems like the error I got was not from the list element being type string. But rather that the column is a cast to varchar expression. If you would like to reproduce this you can add the test case:

// PASS
#[tokio::test]
async fn roundtrip_inlist_3() -> Result<()> {
    // Use `assert_expected_plan` here due to alias expression by-passing
    assert_expected_plan(
        "SELECT * FROM data WHERE CAT(b AS int) IN (1, 2, 3)",
        "Filter: data.b = Decimal128(Some(100),5,2) OR data.b = Decimal128(Some(200),5,2) OR data.b = Decimal128(Some(300),5,2)\
        \n  TableScan: data projection=[a, b, c, d, e, f], partial_filters=[data.b = Decimal128(Some(100),5,2) OR data.b = Decimal128(Some(200),5,2) OR data.b = Decimal128(Some(300),5,2)]"
    ).await
}

// FAIL
#[tokio::test]
async fn roundtrip_inlist_4() -> Result<()> {
    roundtrip("SELECT * FROM data WHERE CAST(f AS varchar) IN ('a', 'b', 'c')").await
}
// ERROR LOG
// Error: NotImplemented("Unsupported expression: CAST(data.f AS Utf8) IN ([Utf8(\"a\"), Utf8(\"b\"), Utf8(\"c\")])")

This caused my misunderstanding in terms of what the problem actually was so, for this issue, please ignore these two cases.

To answer question 2, I think we should support InList when the length is over THRESHOLD_INLINE_INLIST in Substrait. However, I do think that the threshold of 3 is a little restrictive. Do you know if there is a way to overwrite that value while the logical plan is being optimized (before we feed it into the Substrait producer)?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 9, 2023

In conclusion, InList support for 'string' is not an issue (roundtrip_inlist_1 PASS).
Instead, we need to support

  1. cast() to varchar for column
  2. InList for length over THRESHOLD_INLINE_INLIST

Do you know if there is a way to overwrite that value while the logical plan is being optimized (before we feed it into the Substrait producer)?

I think we can just change this number
https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L49

And then, the question is what number should we set?

@waynexia waynexia added the substrait Changes to the substrait crate label Nov 13, 2023
@waynexia
Copy link
Member

Closed by #6604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request substrait Changes to the substrait crate
Projects
None yet
Development

No branches or pull requests

4 participants