-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substrait: InList support for strings #6410
Comments
FWIW I think there is a limit above which DataFusion will not inline literals either: https://github.com/apache/arrow-datafusion/blob/dd3a003c1ca4e2109f33277d13f2b0b2fa500337/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L416-L429 So Substrait probably needs to implement InList anyways, though expanding out |
I fail to reproduce the error.
Also, adding the test to the roundtrip logical plan with DataType::UTF8 is also fine. Only failed on the test when len of the list is over THRESHOLD_INLINE_INLIST (3). // PASS
#[tokio::test]
async fn roundtrip_inlist_1() -> Result<()> {
roundtrip("SELECT * FROM data WHERE f IN ('a', 'b', 'c')").await
}
// FAIL
#[tokio::test]
async fn roundtrip_inlist_2() -> Result<()> {
roundtrip("SELECT * FROM data WHERE f IN ('a', 'b', 'c', 'd')").await
}
//ERROR LOG
//Error: NotImplemented("Unsupported expression: data.f IN ([Utf8(\"a\"), Utf8(\"b\"), Utf8(\"c\"), Utf8(\"d\")])") Updated data.csv
Updated Schema let schema = Schema::new(vec![
Field::new("a", DataType::Int64, true),
Field::new("b", DataType::Decimal128(5, 2), true),
Field::new("c", DataType::Date32, true),
Field::new("d", DataType::Boolean, true),
Field::new("e", DataType::UInt32, true),
Field::new("f", DataType::Utf8, true),
]); Therefore I have two questions,
|
Thank you @jayzhan211 for looking into this issue. First of all, I want to apologize for the late reply. I also reread my issue description and there were a bit of a mixup in the example since I was testing multiple cases. Having said that I think there are a few things I should clarify. It seems like the error I got was not from the list element being type string. But rather that the column is a cast to varchar expression. If you would like to reproduce this you can add the test case: // PASS
#[tokio::test]
async fn roundtrip_inlist_3() -> Result<()> {
// Use `assert_expected_plan` here due to alias expression by-passing
assert_expected_plan(
"SELECT * FROM data WHERE CAT(b AS int) IN (1, 2, 3)",
"Filter: data.b = Decimal128(Some(100),5,2) OR data.b = Decimal128(Some(200),5,2) OR data.b = Decimal128(Some(300),5,2)\
\n TableScan: data projection=[a, b, c, d, e, f], partial_filters=[data.b = Decimal128(Some(100),5,2) OR data.b = Decimal128(Some(200),5,2) OR data.b = Decimal128(Some(300),5,2)]"
).await
}
// FAIL
#[tokio::test]
async fn roundtrip_inlist_4() -> Result<()> {
roundtrip("SELECT * FROM data WHERE CAST(f AS varchar) IN ('a', 'b', 'c')").await
}
// ERROR LOG
// Error: NotImplemented("Unsupported expression: CAST(data.f AS Utf8) IN ([Utf8(\"a\"), Utf8(\"b\"), Utf8(\"c\")])") This caused my misunderstanding in terms of what the problem actually was so, for this issue, please ignore these two cases. To answer question 2, I think we should support |
In conclusion, InList support for 'string' is not an issue (roundtrip_inlist_1 PASS).
I think we can just change this number And then, the question is what number should we set? |
Closed by #6604 |
Is your feature request related to a problem or challenge?
Currently, DataFusion turns
InList
into OR/EQUAL expressions only when the columns is numeric. If the columns is type string, it kept the expression as InList. Thus, running the producer with a sql string:produces error:
Error: NotImplemented("Unsupported expression: customer_address.ca_county IN ([Utf8(\"Vermilion County\"), Utf8(\"Park County\"), Utf8(\"Dorchester County\"), Utf8(\"Republic County\"), Utf8(\"Hayes County\")])")
However, if running with sql string:
or
the producer works correctly.
Describe the solution you'd like
Support for
InList
for string type.Describe alternatives you've considered
If DataFusion logical plan optimizer turns
InList
(string) into OR/EQUAL expressions, then this feature from substrait would not be needed.Additional context
No response
The text was updated successfully, but these errors were encountered: