-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ReturnTypeInfo
to return a Field
rather than DataType
#14247
Comments
FYI @jayzhan211 and @kylebarron |
I poked around with this a litte bit and I think it may be more invasive -- we probably have to clean up some other plumbing like I am not sure I have time to push on this particular issue at this time, but I wanted to file the issue |
I think /// Return metadata for this function.
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]
pub struct ReturnInfo {
return_type: DataType,
nullable: bool,
} |
Just checking if we're on the same page @alamb It would be great if this would could support use case like https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse |
Given we have pub struct ReturnTypeArgs<'a> {
/// The schema fields of the arguments. Fields include DataType, nullability and other information.
pub arg_fields: Fields,
/// ...
pub scalar_arguments: &'a [Option<&'a ScalarValue>],
} We also need to add |
I agree that is a good idea
Yes @milenkovicm -- that is what I hope would be possible. Note this particular ticket describes only part of what would be needed, I think several other APIs need to be updated to use Field rather than
|
Just a note that this all quite a lot of work to go through to avoid adding an |
i guess we have no option and at some point need to do the Arrow way, with the field metadata. We can go about this in two ways
(I am aware natural gravity always leans towards small increments, irrespective of end-game results) |
Because the |
Evolving incrementally is the way to go. Consider the // function input should be logical concept
pub struct ReturnTypeArgs<'a> {
/// The data types of the arguments to the function
pub arg_types: &'a [Arc<dyn LogicalType>],
/// ...
pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}
// return information should be physical concept
pub struct ReturnInfo {
// Physical Type concept
field: Field
} However, given |
Using Given we will have this "the type object" anyway, we should define it upfront and just use in |
I agree -- but at least it makes it possible to get the info
If the goal is to have implementation of extension types in the core of datafusion, I agree with this statement and that is an excellent point. However, in my mind the goal is actually to have almost no specific extension type support in DataFusion itself, but instead have all the implementations live elsewhere (e.g. all geospatial packages live in some other repo, and there is no geospatial knowledge in the core of DataFusion) |
I am fine with DF not shipping extension types (ie no extension types until we add them explicitly in #12644). So the bare minimum is -- given a Field, DF core needs to understand whether this is a type it knows about or a type it doesn't know about.... So we're almost back to explicit extension types. |
Also a note that using a Field would require a serialization/deserialization every time the extension type is used (whereas some core "datatype" based on the
Even if it doesn't know about extension types, it could always calculate equality for dispatch on |
If we check & interpret the datafusion/datafusion/common/src/types/logical.rs Lines 33 to 36 in 6686e03
|
I agree that the code as written today would compare the columns as binary rather than the user defined type.
I also agree with this.
Here are some possible ways we could support Option 1: User defined operatorsIn this case, we would let users override I think this might get tricky when multiple extension types were used (it might be hard to hook json and geometry without a bunch of glue code) Option 2: Custom analyzer rulesI this case the extension could add a custom ANalyzer rule that walked over all plan This might not be ideal as there would likely be a lot of replicated code in extensions (like matching on equality) |
I think we could avoid this by doing a |
Yes, once plan is lowered into "container" arrow types (like assembly), we no longer need to remember what were the logical/extension types. Before the lowering happens, the functions and operators need to be resolved. This doesn't happen at the Expr construction time, though, so it IMO calls for a strict separation of phases:
Extensible coercion rules is a tricky thing indeed. Maybe we can leave without them (for now) But there are simpler thing to solve as well, like casts: If "my JSON" type uses DataType::Binary as its container type, it still wants to define its own family of casts to various other types (numbers, text, etc.). So the Cast Expr would need to resolve to some UDF, when source type or target type are not native types.
That sounds easy because we don't have to write this logic even once. |
I agree with this (and you other points mostly). I do think it would likely be wise to sort out what that logic looks like outside the core crate (as a way to ensure we know what the APIs should look like). Once we have some examples putting the APIs in the core makes a lot of sense |
Is your feature request related to a problem or challenge?
It seems the design of Arrow extension types is nearing consensus and will arrive soon
ExtensionType
trait andCanonicalExtensionType
enum arrow-rs#5822The extension type information is encoded in an Arrow
Field
(doclink link) (which has both aDataType
and the metadata information)In this world, supporting a user function for a user defined type (e.g. a geometry type) I think would look like
DataType::Binary
return_type_from_args
function which would then try to get the user defined type information from the Binary column and verify it was correctHowever, since the
ReturnTypeInfo
only providesDataType
the theField
information will not be present and thus UDF writers will not be able to access extension type informationdatafusion/datafusion/expr/src/udf.rs
Line 359 in 274e535
Describe the solution you'd like
Since we have not released
return_type_from_args
yet (it will be released in DataFusion 45) I would like to try and change the API before release to support user defined typesDescribe alternatives you've considered
Specifically, I would like to pass in
Field
instead ofDataType
inReturnTypeArgs
So instead of
I think it would be better to be
Additional context
This was inspired by a comment from @milenkovicm on the DataFusion sync up call yesterday
The text was updated successfully, but these errors were encountered: