-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032
Conversation
For built-in functions like For example, I would like to be able to write: df.select(vec![col("foo"), sqrt(col("bar"))])? |
This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical
I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation |
For anyone else reading along, the associated document I think is https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing |
@alamb thank you very much for your comments, I will now work on addressing them now. I still learning the Arc/Box/Ref, so thank you a lot for also teaching me. @andygrove , I agree with you that built-in functions should not require access to the registry. Unfortunately, doing so required some re-work, which is the reason I retracted #7967 back to draft to focus on this one first. I pushed a new commit to this PR to address this point. Specifically, that commit adds:
I am pretty happy with this PR, as IMO has the flexibility we need to expand DataFusion's pool of built-in functions to multiple input and return types. The main features of this PR:
I have not completed the valid return types of built-in math functions as this PR was already too long. Overall, I think that this has not been a pleasant experience for you @andygrove and @alamb, as I constantly open and close PRs around functions/UDFs, and for that I am really sorry. I've been hitting some design challenge after another, which requires me to go back and forth. I am still in pursuit of my original quests:
I have code for some of this, I... just... need... to... finish... the... scalar... stuff... first... 😃 |
…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of #8032 with a much smaller scope, as the other one is getting to large to handle. Closes #8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
Good luck -- at this stage of a project (when architecture is changing a
bunch) know it is hard to make small / easy to review PRs. I hope the
comments are helpful and I am sorry I don't have more time to devote to
reviews.
…On Tue, Aug 25, 2020 at 4:10 PM Jorge Leitao ***@***.***> wrote:
@alamb <https://github.com/alamb> thank you very much for your comments,
I will now work on addressing them now. I still learning the Arc/Box/Ref,
so thank you a lot for also teaching me.
@andygrove <https://github.com/andygrove> , I agree with you that
built-in functions should not require access to the registry.
Unfortunately, doing so required some re-work, which is the reason I
retracted #7967 <#7967> back to draft
to focus on this one first.
I pushed a new commit to this PR to address this point. Specifically, that
commit adds:
- a new enum with all built-in functions
- functionally gluing the logical plan with the physical plan so that
the function's return types are invariant.
- made type coercion on built-in functions to be on the physical
plane, to preserve schema invariance during planning.
I am pretty happy with this PR, as IMO has the flexibility we need to
expand DataFusion's pool of built-in functions to multiple input and return
types. The main features of this PR:
- users no longer have to pass the return type of the UDF when calling
them (the proposal)
- planning built-in functions continue to not need access to the
registry ***@***.*** <https://github.com/andygrove> 's point)
- built-in functions now support multiple input types (e.g. sqrt(f32),
sqrt(f64))
- built-in functions now support multiple return types (e.g. sqrt(f32)
-> f32, sqrt(f64) -> f64)
- coercion rules are no longer applied in the sql planning or physical
planning to built-in functions, to avoid breaking schema invariance during
planning
I have not completed the valid return types of built-in math functions as
this PR was already too long.
Overall, I think that this has not been a pleasant experience for you
@andygrove <https://github.com/andygrove> and @alamb
<https://github.com/alamb>, as I constantly open and close PRs around
functions/UDFs, and for that I am really sorry. I've been hitting some
design challenge after another, which requires me to go back and forth.
I am still in pursuit of my original quests:
- built-in aggregate functions whose logical types are known from the
physical expressions
- type coercion on aggregate functions
- built-in aggregate functions whose return types (e.g. min(f32) -> f32,
min(f64) -> f64) are directly derived from the physical plan (there is
an old fixme/todo in the code around that)
- aggregate udfs
- udfs with multiple incoming and return types, to bring them to the
same level of functionality of built-ins
- planning a udf without registering it (a-la spark) in the DF's API.
I have code for some of this, I... just... need... to... finish... the...
scalar... stuff... first... 😃
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8032 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMLPUU3TGXRLTCKWWKLSCQLEFANCNFSM4QIW6KXQ>
.
|
FYI, this is what a PR to add support for f32 to mathematical expressions (keeping the return type f64) looks like: I.e. IMO with this PR we can support almost any built-in function: fixed type, variable return type, multiple input types, etc on I split built-ins from the UDFs because built-ins type is known without access to the registry, which is currently required to allow users to use them outside the |
FYI, this is what a PR would look like for the concatenate function: https://github.com/jorgecarleitao/arrow/pull/2/files using this API. I am not advocating that we follow this design (option 3 in this comment) would also be fine. My point is that regardless of which option we pick, we will need to have the functionality in this PR:
|
And finally, this is how we would add the To summarize:
The API that I am proposing here addresses all these cases out of the box. The 3 PRs in my repo,
add support to each of them at the physical and logical level. This PR also includes all the required coercion rules for this to work. E.g. |
…s for built-in functions @alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes #8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
After a long digression through the realm of built-ins, this has now been simplified and rebased against master. @andygrove and @alamb , ready for a re-review. Again, the core goal here is to allow users to use UDFs without having to worry about their return type. I've incorporated all points from @alamb and @andygrove so far:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this code -- I think the UDF usage / definition is clearer now and the code looks better. Really nice work @jorgecarleitao
} else { | ||
expr.clone() | ||
}; | ||
let mut projected_expr = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to verify my understanding -- this is a code cleanup that is not directly required for UDFs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirectly, it is related: expr.contains(&Expr::Wildcard)
requires PartialEq
, which was dropped in this line due to the addition of an Arc<ScalarFunction>
in this line.
However, since if expr.contains(&Expr::Wildcard)
thankfully was entirely optional, I dropped it ^_^
/// # Ok(()) | ||
/// # } | ||
/// ``` | ||
fn registry(&self) -> &dyn FunctionRegistry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this registry specific to scalar functions or will it also be used for aggregate functions? Perhaps we should name the method either function_registry
or scalar_function_registry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and this is a nice improvement! My only comment is that maybe some of the naming could be made more consistent, or documentation made clearer .. the things that weren't immediately clear to me (although I didn't spend much time on the review) were:
- Is the registry just for user-defined scalar functions, or does it include the built-in scalar functions as well?
- The registry has methods like
udf
but should we consider namingscalar_udf
? I wasn't sure what the plan was for user-defined aggregate functions (UDAFs)
I agree with you, @andygrove .
I have no strong opinions about naming nor UX here: I will implement whatever you agree upon :-). My main concern was to fix the data type thing :P |
@jorgecarleitao @alamb I'm catching up on the PRs today. It looks like this one is ready to merge? |
I think so, @andygrove . There is probably some renaming once we have UDAFs. For now, I think it is fine. |
…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of #8032. Closes #7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle. Closes apache#8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
See associated issue and document for details. The gist is that currently, users call UDFs through ``` df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64)) ``` and this PR proposes a change to ``` let functions = df.registry()?; df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?) ``` so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally). Closes apache#8032 from jorgecarleitao/registry Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of apache#8032. Closes apache#7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle. Closes apache#8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
See associated issue and document for details. The gist is that currently, users call UDFs through ``` df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64)) ``` and this PR proposes a change to ``` let functions = df.registry()?; df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?) ``` so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally). Closes apache#8032 from jorgecarleitao/registry Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of apache#8032. Closes apache#7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
See associated issue and document for details.
The gist is that currently, users call UDFs through
and this PR proposes a change to
so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally).