Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common_sub_expression_eliminate exists bug #4575

Closed
jackwener opened this issue Dec 10, 2022 · 7 comments · Fixed by #4733
Closed

common_sub_expression_eliminate exists bug #4575

jackwener opened this issue Dec 10, 2022 · 7 comments · Fixed by #4733
Labels
bug Something isn't working

Comments

@jackwener
Copy link
Member

jackwener commented Dec 10, 2022

Describe the bug

It will make some strange problem.

Error: Internal("Optimizer rule 'common_sub_expression_eliminate' failed, due to generate a different schema, original schema: 

DFSchema { fields: [
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_id\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_name\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_int\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_id\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_name\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_int\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }, new schema: 

DFSchema { fields: [
DFField { qualifier: None, field: Field { name: \"t2.t2_int < UInt32(10)UInt32(10)t2.t2_int\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_id\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_name\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t1\"), field: Field { name: \"t1_int\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_id\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_name\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, 
DFField { qualifier: Some(\"t2\"), field: Field { name: \"t2_int\", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }")

original:
"Filter: t2.t2_int < Int64(10) OR t1.t1_int > Int64(2) AND t2.t2_name != Utf8("w")
  Left Join: t1.t1_id = t2.t2_id
    TableScan: t1
    TableScan: t2"

after:
"Filter: (t2.t2_int < UInt32(10)UInt32(10)t2.t2_int AS t2.t2_int < UInt32(10) OR t1.t1_int > UInt32(2)) AND (t2.t2_int < UInt32(10)UInt32(10)t2.t2_int AS t2.t2_int < UInt32(10) OR t2.t2_name != Utf8("w"))
  Projection: t2.t2_int < UInt32(10) AS t2.t2_int < UInt32(10)UInt32(10)t2.t2_int, t1.t1_id, t1.t1_name, t1.t1_int, t2.t2_id, t2.t2_name, t2.t2_int
    Inner Join: t1.t1_id = t2.t2_id
      TableScan: t1 projection=[t1_id, t1_name, t1_int]
      TableScan: t2 projection=[t2_id, t2_name, t2_int]"

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@jackwener
Copy link
Member Author

multiple_or_predicates() generate nasty result, look like this bug cause it. #4487 (comment) it's a related comment

@jackwener
Copy link
Member Author

Current this method is depend on the Self::desc_expr(expr)

let desc = Self::desc_expr(expr);

It will cause some strange expr.

id_array:

[
(15, "(t2.t2_int < UInt32(10) OR t1.t1_int > UInt32(2)) AND (t2.t2_int < UInt32(10) OR t2.t2_name != Utf8(\"w\"))t2.t2_int < UInt32(10) OR t2.t2_name != Utf8(\"w\")t2.t2_name != Utf8(\"w\")Utf8(\"w\")t2.t2_namet2.t2_int < UInt32(10)UInt32(10)t2.t2_intt2.t2_int < UInt32(10) OR t1.t1_int > UInt32(2)t1.t1_int > UInt32(2)UInt32(2)t1.t1_intt2.t2_int < UInt32(10)UInt32(10)t2.t2_int")
, (7, "t2.t2_int < UInt32(10) OR t1.t1_int > UInt32(2)t1.t1_int > UInt32(2)UInt32(2)t1.t1_intt2.t2_int < UInt32(10)UInt32(10)t2.t2_int")
, (3, "t2.t2_int < UInt32(10)UInt32(10)t2.t2_int")
, (1, "")
, (2, "")
, (6, "t1.t1_int > UInt32(2)UInt32(2)t1.t1_int")
, (4, "")
, (5, "")
, (14, "t2.t2_int < UInt32(10) OR t2.t2_name != Utf8(\"w\")t2.t2_name != Utf8(\"w\")Utf8(\"w\")t2.t2_namet2.t2_int < UInt32(10)UInt32(10)t2.t2_int")
, (10, "t2.t2_int < UInt32(10)UInt32(10)t2.t2_int")
, (8, "")
, (9, "")
, (13, "t2.t2_name != Utf8(\"w\")Utf8(\"w\")t2.t2_name")
, (11, "")
, (12, "")
]

id_array/arrays_list -> affected_id -> project_exprs/predicate

it will cause strange column/expr

@jackwener
Copy link
Member Author

problem in #3635 look like exist multiple bug.

cc @liukun4515 @andygrove @alamb @alex-natzka

@alamb
Copy link
Contributor

alamb commented Dec 10, 2022

cc @waynexia whom I think contributed the optimization originally, and perhaps might have some suggestions

@alex-spies
Copy link
Contributor

Hi, I'm afraid I won't have time to look into this. I did go over the error message though, here's what I think is happening:

  • I think some other optimization rule than common_sub_expression_eliminate transforms the filter
    t2.t2_int < Int64(10) OR t1.t1_int > Int64(2) AND t2.t2_name != Utf8("w")
    into
    (t2.t2_int < UInt32(10) OR t1.t1_int > UInt32(2)) AND (t2.t2_int < UInt32(10) OR t2.t2_name != Utf8("w"))
    which is weird but technically not wrong.
  • Then common_sub_expression_eliminate sees that t2.t2_int < UInt32(10) occurs twice, so it adds a projection before the filter where this expression gets computed. (This is correct and expected.) The projection creates the additional column t2.t2_int < UInt32(10)UInt32(10)t2.t2_int as alias for t2.t2_int < UInt32(10).
  • This changes the output schema, because we don't project out the column again.

The last point is the problem IMO. I guess common_sub_expression_eliminate would need to add another projection after the filter that gets rid of the extra column.

@waynexia
Copy link
Member

Appreciate the investigation @alex-natzka ❤️

  • This changes the output schema, because we don't project out the column again.

The last point is the problem IMO. I guess common_sub_expression_eliminate would need to add another projection after the filter that gets rid of the extra column.

This point sounds reasonable to me. I revisited the unit tests in common_sub_expression_eliminate, and looks like I only used the plans that have their own output schema like Projection or Aggregator. Which will ignore the additional column by itself. I plan to look into it this week (and sorry for the late reply 🥲).

@waynexia
Copy link
Member

Hi, I submit #4733 to fix this. Please check it out 🥰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants