Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Expr::InList to Substrait::RexType #6604

Merged
merged 7 commits into from
Jun 20, 2023

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jun 9, 2023

Which issue does this PR close?

Ref #6410

Closes #.

Rationale for this change

Expr::InList to Substrait::RexType is not supported yet.

What changes are included in this PR?

Transform InList to OrList (AndList) then convert them to Rex with make_binary_op_scalar_func, since there is no other equivalent RexType for InList

Are these changes tested?

Unit test in datafusion/substrait/tests/roundtrip_logical_plan.rs

Are there any user-facing changes?

@github-actions github-actions bot added the substrait Changes to the substrait crate label Jun 9, 2023

let init_val = make_binary_op_scalar_func(
&substrait_expr,
&substrait_list[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substrait_list[0] -> .first()? This will return Option so better to check if option is defined and return Err if not

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- it is awesome to see our substrait support coming along so well

I looked at the substrait docs a little bit and it seems as if it may have native support for IN list:

https://substrait.io/expressions/specialized_record_expressions/#or-list-equality-expression

A specialized structure that is often used is a large list of possible values. In SQL, these are typically large IN lists. They can be composed from one or more fields. There are two common patterns, single value and multi value. In pseudocode they are represented as:

So perhaps ExprInList could be serialized as SingularOrList:

https://docs.rs/substrait/0.11.0/substrait/proto/expression/struct.SingularOrList.html

I wonder if @waynexia and @nseekhao have reaction or comment

@waynexia
Copy link
Member

@alamb's suggestion makes sense to me. For NOT IN there is no direct corresponding expression, maybe we can wrap an Expr::Not over SingularOrList.

Interestingly, I inspected the official java implementation of substrait via its cli tool Isthmus, it seems like they don't use SingularOrList as well. Don't know the reason. But through the doc it's applicable to use SingularOrList here.

Here is the result I get from SELECT a FROM t where a IN ('a', 'b') and SELECT a FROM t where a not IN ('a', 'b'):

`SELECT a FROM t where a IN ('a', 'b')`
{
"extensionUris": [{
  "extensionUriAnchor": 1,
  "uri": "/functions_boolean.yaml"
}, {
  "extensionUriAnchor": 2,
  "uri": "/functions_comparison.yaml"
}],
"extensions": [{
  "extensionFunction": {
    "extensionUriReference": 1,
    "functionAnchor": 0,
    "name": "or:bool"
  }
}, {
  "extensionFunction": {
    "extensionUriReference": 2,
    "functionAnchor": 1,
    "name": "equal:any_any"
  }
}],
"relations": [{
  "root": {
    "input": {
      "project": {
        "common": {
          "emit": {
            "outputMapping": [3]
          }
        },
        "input": {
          "filter": {
            "common": {
              "direct": {
              }
            },
            "input": {
              "read": {
                "common": {
                  "direct": {
                  }
                },
                "baseSchema": {
                  "names": ["A", "B", "C"],
                  "struct": {
                    "types": [{
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }, {
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }, {
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }],
                    "typeVariationReference": 0,
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },
                "namedTable": {
                  "names": ["T"]
                }
              }
            },
            "condition": {
              "scalarFunction": {
                "functionReference": 0,
                "args": [],
                "outputType": {
                  "bool": {
                    "typeVariationReference": 0,
                    "nullability": "NULLABILITY_NULLABLE"
                  }
                },
                "arguments": [{
                  "value": {
                    "scalarFunction": {
                      "functionReference": 1,
                      "args": [],
                      "outputType": {
                        "bool": {
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      "arguments": [{
                        "value": {
                          "selection": {
                            "directReference": {
                              "structField": {
                                "field": 0
                              }
                            },
                            "rootReference": {
                            }
                          }
                        }
                      }, {
                        "value": {
                          "cast": {
                            "type": {
                              "string": {
                                "typeVariationReference": 0,
                                "nullability": "NULLABILITY_NULLABLE"
                              }
                            },
                            "input": {
                              "literal": {
                                "fixedChar": "a",
                                "nullable": false,
                                "typeVariationReference": 0
                              }
                            },
                            "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED"
                          }
                        }
                      }],
                      "options": []
                    }
                  }
                }, {
                  "value": {
                    "scalarFunction": {
                      "functionReference": 1,
                      "args": [],
                      "outputType": {
                        "bool": {
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      "arguments": [{
                        "value": {
                          "selection": {
                            "directReference": {
                              "structField": {
                                "field": 0
                              }
                            },
                            "rootReference": {
                            }
                          }
                        }
                      }, {
                        "value": {
                          "cast": {
                            "type": {
                              "string": {
                                "typeVariationReference": 0,
                                "nullability": "NULLABILITY_NULLABLE"
                              }
                            },
                            "input": {
                              "literal": {
                                "fixedChar": "b",
                                "nullable": false,
                                "typeVariationReference": 0
                              }
                            },
                            "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED"
                          }
                        }
                      }],
                      "options": []
                    }
                  }
                }],
                "options": []
              }
            }
          }
        },
        "expressions": [{
          "selection": {
            "directReference": {
              "structField": {
                "field": 0
              }
            },
            "rootReference": {
            }
          }
        }]
      }
    },
    "names": ["A"]
  }
}],
"expectedTypeUrls": []
}
`SELECT a FROM t where a not IN ('a', 'b')`
{
"extensionUris": [{
  "extensionUriAnchor": 1,
  "uri": "/functions_boolean.yaml"
}, {
  "extensionUriAnchor": 2,
  "uri": "/functions_comparison.yaml"
}],
"extensions": [{
  "extensionFunction": {
    "extensionUriReference": 1,
    "functionAnchor": 0,
    "name": "not:bool"
  }
}, {
  "extensionFunction": {
    "extensionUriReference": 1,
    "functionAnchor": 1,
    "name": "or:bool"
  }
}, {
  "extensionFunction": {
    "extensionUriReference": 2,
    "functionAnchor": 2,
    "name": "equal:any_any"
  }
}],
"relations": [{
  "root": {
    "input": {
      "project": {
        "common": {
          "emit": {
            "outputMapping": [3]
          }
        },
        "input": {
          "filter": {
            "common": {
              "direct": {
              }
            },
            "input": {
              "read": {
                "common": {
                  "direct": {
                  }
                },
                "baseSchema": {
                  "names": ["A", "B", "C"],
                  "struct": {
                    "types": [{
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }, {
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }, {
                      "string": {
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    }],
                    "typeVariationReference": 0,
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },
                "namedTable": {
                  "names": ["T"]
                }
              }
            },
            "condition": {
              "scalarFunction": {
                "functionReference": 0,
                "args": [],
                "outputType": {
                  "bool": {
                    "typeVariationReference": 0,
                    "nullability": "NULLABILITY_NULLABLE"
                  }
                },
                "arguments": [{
                  "value": {
                    "scalarFunction": {
                      "functionReference": 1,
                      "args": [],
                      "outputType": {
                        "bool": {
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      "arguments": [{
                        "value": {
                          "scalarFunction": {
                            "functionReference": 2,
                            "args": [],
                            "outputType": {
                              "bool": {
                                "typeVariationReference": 0,
                                "nullability": "NULLABILITY_NULLABLE"
                              }
                            },
                            "arguments": [{
                              "value": {
                                "selection": {
                                  "directReference": {
                                    "structField": {
                                      "field": 0
                                    }
                                  },
                                  "rootReference": {
                                  }
                                }
                              }
                            }, {
                              "value": {
                                "cast": {
                                  "type": {
                                    "string": {
                                      "typeVariationReference": 0,
                                      "nullability": "NULLABILITY_NULLABLE"
                                    }
                                  },
                                  "input": {
                                    "literal": {
                                      "fixedChar": "a",
                                      "nullable": false,
                                      "typeVariationReference": 0
                                    }
                                  },
                                  "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED"
                                }
                              }
                            }],
                            "options": []
                          }
                        }
                      }, {
                        "value": {
                          "scalarFunction": {
                            "functionReference": 2,
                            "args": [],
                            "outputType": {
                              "bool": {
                                "typeVariationReference": 0,
                                "nullability": "NULLABILITY_NULLABLE"
                              }
                            },
                            "arguments": [{
                              "value": {
                                "selection": {
                                  "directReference": {
                                    "structField": {
                                      "field": 0
                                    }
                                  },
                                  "rootReference": {
                                  }
                                }
                              }
                            }, {
                              "value": {
                                "cast": {
                                  "type": {
                                    "string": {
                                      "typeVariationReference": 0,
                                      "nullability": "NULLABILITY_NULLABLE"
                                    }
                                  },
                                  "input": {
                                    "literal": {
                                      "fixedChar": "b",
                                      "nullable": false,
                                      "typeVariationReference": 0
                                    }
                                  },
                                  "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED"
                                }
                              }
                            }],
                            "options": []
                          }
                        }
                      }],
                      "options": []
                    }
                  }
                }],
                "options": []
              }
            }
          }
        },
        "expressions": [{
          "selection": {
            "directReference": {
              "structField": {
                "field": 0
              }
            },
            "rootReference": {
            }
          }
        }]
      }
    },
    "names": ["A"]
  }
}],
"expectedTypeUrls": []
}

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 12, 2023

@waynexia To wrap OrList with Expr::Not, I think we need to transform Expr::Not to one of the RexType, how could we do this? Should we need Operator::Not and wrap it with RexType::ScalarFunction?

@jayzhan211 jayzhan211 marked this pull request as draft June 12, 2023 11:51
@nseekhao
Copy link
Contributor

@jayzhan211 I believe NOT IN will be parsed into Expr::InList with negated = true.
So I think we can implement the producer as you suggested.

  1. If matched to Expr::InList, turn the DF InList expression into RexType::SingularOrList
  2. If expr.negated is true
    i) Register "not" function: let function_anchor = _register_function("not", extension_info);
    ii) Wrap RexType::SingularOrList in RexType::ScalarFunction with function_reference: function_anchor
  3. Return the parsed expression

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 13, 2023

It seems Operator are all for two arguments, so instead of introducing Operator::Not, I add Not in ScalarFunctionType.

Producer: Transform to Substrait::ScalarFunction for NOT, e.g. ScalarFunction('not', InList)
Consumer: Wrap Expr::Not to InList. e.g. Expr::Not(InList). (If InList is the only expected expr, we could change the negated flag inside Expr::InList).

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review June 13, 2023 10:56
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the optimizer Optimizer rules label Jun 14, 2023
@jayzhan211 jayzhan211 requested review from alamb and nseekhao June 16, 2023 01:00
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks @jayzhan211 👍

@alamb
Copy link
Contributor

alamb commented Jun 16, 2023

Thank you @jayzhan211 and @waynexia -- I took the liberty of merging this PR with the latest master to resolve (that I caused with #6686)

@alamb
Copy link
Contributor

alamb commented Jun 16, 2023

I plan to merge this PR when the tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants