Change `ReturnTypeInfo` to return a `Field` rather than `DataType` #14247

alamb · 2025-01-23T11:11:18Z

Is your feature request related to a problem or challenge?

It seems the design of Arrow extension types is nearing consensus and will arrive soon

The extension type information is encoded in an Arrow Field (doclink link) (which has both a DataType and the metadata information)

In this world, supporting a user function for a user defined type (e.g. a geometry type) I think would look like

Creating a user defined function and declaring in the signature that it takes DataType::Binary
Implementing the return_type_from_args function which would then try to get the user defined type information from the Binary column and verify it was correct

However, since the ReturnTypeInfo only provides DataType the the Field information will not be present and thus UDF writers will not be able to access extension type information

datafusion/datafusion/expr/src/udf.rs

Line 359 in 274e535

pub struct ReturnTypeArgs<'a> {

Describe the solution you'd like

Since we have not released return_type_from_args yet (it will be released in DataFusion 45) I would like to try and change the API before release to support user defined types

Describe alternatives you've considered

Specifically, I would like to pass in Field instead of DataType in ReturnTypeArgs

So instead of

pub struct ReturnTypeArgs<'a> {
    /// The data types of the arguments to the function
    pub arg_types: &'a [DataType],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
    /// Can argument `i` (ever) null?
    pub nullables: &'a [bool],
}

I think it would be better to be

pub struct ReturnTypeArgs<'a> {
    /// The schema fields of the arguments. Fields include DataType, nullability and other information.
    pub arg_fields: &'a [Field],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

Additional context

This was inspired by a comment from @milenkovicm on the DataFusion sync up call yesterday

The text was updated successfully, but these errors were encountered:

alamb · 2025-01-23T11:11:30Z

FYI @jayzhan211 and @kylebarron

alamb · 2025-01-23T11:14:42Z

I poked around with this a litte bit and I think it may be more invasive -- we probably have to clean up some other plumbing like ExprSchemable

I am not sure I have time to push on this particular issue at this time, but I wanted to file the issue

alamb · 2025-01-23T11:22:56Z

I think ReturnInfo would also need to be updated to return a Field rather than a subset:

/// Return metadata for this function.
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]
pub struct ReturnInfo {
    return_type: DataType,
    nullable: bool,
}

milenkovicm · 2025-01-23T11:30:15Z

Just checking if we're on the same page @alamb

It would be great if this would could support use case like https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse

jayzhan211 · 2025-01-23T12:07:28Z

Given we have to_field already, maybe Fields is easier to implement

pub struct ReturnTypeArgs<'a> {
    /// The schema fields of the arguments. Fields include DataType, nullability and other information.
    pub arg_fields: Fields,
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

We also need to add field method for PhysicalExpr

alamb · 2025-01-23T14:03:16Z

Given we have to_field already, maybe Fields is easier to implement

I agree that is a good idea

It would be great if this would could support use case like https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse

Yes @milenkovicm -- that is what I hope would be possible.

Note this particular ticket describes only part of what would be needed, I think several other APIs need to be updated to use Field rather than DataType as well, such as what @jayzhan211 identifies

We also need to add field method for PhysicalExpr

paleolimbot · 2025-01-23T16:19:07Z

Just a note that this all quite a lot of work to go through to avoid adding an Extension member to the DataType enum. It seems like you could avoid a breaking change there by adding field.data_type_maybe_extension()?

findepi · 2025-01-23T18:57:07Z

However, since the ReturnTypeInfo only provides DataType the the Field information will not be present and thus UDF writers will not be able to access extension type information

i guess we have no option and at some point need to do the Arrow way, with the field metadata.
That implies that any type-dealing logic needs to take (DataType, metadata) pair as the input.
For example, geometry can use Utf8 container type, but it won't use Utf8's functions, casts or coercion rules. Once a field is marked to be "an extension type", it cannot be construed as its container type anymore!

We can go about this in two ways

try to evolve incrementally, keep using DataType in many places, but add metadata to more and more places incrementally. At the end of this, the code is not beautiful (rather hard to reason about), and in the meantime it's not correct
try to decouple ourselves from arrow type system to query planning. This would be logical types [EPIC] Decouple logical from physical types #12622, explicit extension types Extension Types / User Defined Types #12644 (thus addressing @paleolimbot 's concern above)
- this means more foundational work upfront, maybe even adding stuff like new IR (Proposal: introduced typed expressions, separate AST and IR #12604) allowing decoupling various DF query processing aspects; but should lead to a design we can be more proud about

(I am aware natural gravity always leans towards small increments, irrespective of end-game results)

kylebarron · 2025-01-23T23:20:43Z

Just a note that this all quite a lot of work to go through to avoid adding an Extension member to the DataType enum.

Because the DataType enum is currently exhaustive, adding an Extension variant in itself would be a breaking change.

jayzhan211 · 2025-01-24T00:38:38Z

try to evolve incrementally, keep using DataType in many places, but add metadata to more and more places incrementally. At the end of this, the code is not beautiful (rather hard to reason about), and in the meantime it's not correct

Evolving incrementally is the way to go. logical-types is an example that it is quite an challenge for contributor and reviewer to keep it up until it lands.

Consider the logical type support, we might switch to LogicalType trait in the future, we can implement LogicalType for those extension types. In this case, we don't need Field.

// function input should be logical concept
pub struct ReturnTypeArgs<'a> {
    /// The data types of the arguments to the function
    pub arg_types: &'a [Arc<dyn LogicalType>],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

// return information should be physical concept
pub struct ReturnInfo {
  // Physical Type concept
  field: Field
}

However, given logical types is still work in progress, and not clear whether the assumption and the design makes sense at all, I think switching to Field and incrementally evolve would be a better choice

findepi · 2025-01-24T07:45:14Z

Using Field directly doesn't yet answer the question "how to interpret given field".
It's possible to poke at field.data_type() ignoring metadata, or check for specific metadata keys that may override how the field should be interpreted. Doing this in every place separately is unwieldy. So we gonna have a helper function that takes a Field and returns... "the type object" representing the type of that field -- either arrow DataType or extension.

Given we will have this "the type object" anyway, we should define it upfront and just use in ReturnTypeArgs.
Maybe we can use the LogicalType trait we already have on main for that purpose.

alamb · 2025-01-24T10:41:31Z

Using Field directly doesn't yet answer the question "how to interpret given field".

I agree -- but at least it makes it possible to get the info

So we gonna have a helper function that takes a Field and returns... "the type object" representing the type of that field -- either arrow DataType or extension.

If the goal is to have implementation of extension types in the core of datafusion, I agree with this statement and that is an excellent point.

However, in my mind the goal is actually to have almost no specific extension type support in DataFusion itself, but instead have all the implementations live elsewhere (e.g. all geospatial packages live in some other repo, and there is no geospatial knowledge in the core of DataFusion)

findepi · 2025-01-24T14:50:55Z

I am fine with DF not shipping extension types (ie no extension types until we add them explicitly in #12644).
Let's look at the example. I have Field information for a and b columns, both are DataType::Binary.
What should be the behavior of a = b according to DF core's logic?
Naive answer would be that given same DataType, the expression is valid and Arrow's comparison function should be used.
However, if one (or both) of then happen to be custom-provided extension type (e.g. JSON, VARIANT, ST_Geometry), this logic should not be used at all, not even as a fallback.

So the bare minimum is -- given a Field, DF core needs to understand whether this is a type it knows about or a type it doesn't know about.... So we're almost back to explicit extension types.

paleolimbot · 2025-01-24T15:42:05Z

Also a note that using a Field would require a serialization/deserialization every time the extension type is used (whereas some core "datatype" based on the ExtensionType is parsed once on the way in and serialized once on the way out.

What should be the behavior of a = b according to DF core's logic?

Even if it doesn't know about extension types, it could always calculate equality for dispatch on extension.name and equality otherwise based on extension.name and extension.metadata being byte-for-byte equal. (An ExtensionType implementation, perhaps registered with the session, could provide smarter logic).

findepi · 2025-01-24T21:13:33Z

If we check & interpret the extension.name, then we're at home:

datafusion/datafusion/common/src/types/logical.rs

Lines 33 to 36 in 6686e03

    
               Extension { 
        
                   name: &'a str, 
        
                   parameters: &'a [TypeParameter<'a>], 
        
               },

alamb · 2025-01-24T21:13:53Z

I am fine with DF not shipping extension types (ie no extension types until we add them explicitly in #12644). Let's look at the example. I have Field information for a and b columns, both are DataType::Binary. What should be the behavior of a = b according to DF core's logic? Naive answer would be that given same DataType, the expression is valid and Arrow's comparison function should be used.

I agree that the code as written today would compare the columns as binary rather than the user defined type.

However, if one (or both) of then happen to be custom-provided extension type (e.g. JSON, VARIANT, ST_Geometry), this logic should not be used at all, not even as a fallback.

I also agree with this.

So the bare minimum is -- given a Field, DF core needs to understand whether this is a type it knows about or a type it doesn't know about.... So we're almost back to explicit extension types.

Here are some possible ways we could support = on a user defined type

Option 1: User defined operators

In this case, we would let users override = similar to a UDF, implementing support for extension types. The users implementation could then fall back into the built in = implementation when it didn't have special rules

I think this might get tricky when multiple extension types were used (it might be hard to hook json and geometry without a bunch of glue code)

Option 2: Custom analyzer rules

I this case the extension could add a custom ANalyzer rule that walked over all plan Exprs, finding any that were relevant to the user defined type and rewriting the expressions to use a function (eg. rewrite geo_col1 = geo_col2 into udf_compare_geos(geo_col1, geo_col2)

This might not be ideal as there would likely be a lot of replicated code in extensions (like matching on equality)

alamb · 2025-01-24T21:16:07Z

Also a note that using a Field would require a serialization/deserialization every time the extension type is used (whereas some core "datatype" based on the ExtensionType is parsed once on the way in and serialized once on the way out.

I think we could avoid this by doing a Expr rewrite early on to "resolve" the extension types into function calls and from then on no additional deserialization would be needed 🤔 Though I may be misunderstanding something

findepi · 2025-01-24T21:28:37Z

Yes, once plan is lowered into "container" arrow types (like assembly), we no longer need to remember what were the logical/extension types. Before the lowering happens, the functions and operators need to be resolved. This doesn't happen at the Expr construction time, though, so it IMO calls for a strict separation of phases:

Exprs are syntactical (eg created directly from SQL syntax or dataframe API).
Then analyzer needs to "resolve" operators. This needs to be type aware. E.g. for builtin types it can use = operators, but for other it needs to be told how to compare comparisons (and such), so a UDF gets inserted into the plan.
- After this phase, the plan is "resolved" and doesn't need to remember the original types (except maybe for output fields metadata).

I think this might get tricky when multiple extension types were used (it might be hard to hook json and geometry without a bunch of glue code)

Extensible coercion rules is a tricky thing indeed. Maybe we can leave without them (for now)

But there are simpler thing to solve as well, like casts: If "my JSON" type uses DataType::Binary as its container type, it still wants to define its own family of casts to various other types (numbers, text, etc.). So the Cast Expr would need to resolve to some UDF, when source type or target type are not native types.

finding any that were relevant to the user defined type and rewriting the expressions to use a function (eg. rewrite geo_col1 = geo_col2 into udf_compare_geos(geo_col1, geo_col2)

That sounds easy because we don't have to write this logic even once.
But once such logic is written somewhere, there is no reason for it not to be part of datafusion project, for the benefit of all consumers. I think such logic should belong to datafusion.

alamb · 2025-01-25T11:35:48Z

But once such logic is written somewhere, there is no reason for it not to be part of datafusion project, for the benefit of all consumers. I think such logic should belong to datafusion.

I agree with this (and you other points mostly). I do think it would likely be wise to sort out what that logic looks like outside the core crate (as a way to ensure we know what the APIs should look like). Once we have some examples putting the APIs in the core makes a lot of sense

alamb added the enhancement New feature or request label Jan 23, 2025

alamb mentioned this issue Jan 23, 2025

Extension Types / User Defined Types #12644

Open

findepi mentioned this issue Jan 29, 2025

Remove recompute_schema usage from optimizer #14357

Open

findepi mentioned this issue Feb 11, 2025

Simple Functions #12635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `ReturnTypeInfo` to return a `Field` rather than `DataType` #14247

Change `ReturnTypeInfo` to return a `Field` rather than `DataType` #14247

alamb commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025

alamb commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025 •

edited

Loading

milenkovicm commented Jan 23, 2025

jayzhan211 commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025

paleolimbot commented Jan 23, 2025

findepi commented Jan 23, 2025

kylebarron commented Jan 23, 2025

jayzhan211 commented Jan 24, 2025 •

edited

Loading

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

findepi commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 25, 2025

Change ReturnTypeInfo to return a Field rather than DataType #14247

Change ReturnTypeInfo to return a Field rather than DataType #14247

Comments

alamb commented Jan 23, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jan 23, 2025

alamb commented Jan 23, 2025 • edited Loading

alamb commented Jan 23, 2025 • edited Loading

milenkovicm commented Jan 23, 2025

jayzhan211 commented Jan 23, 2025 • edited Loading

alamb commented Jan 23, 2025

paleolimbot commented Jan 23, 2025

findepi commented Jan 23, 2025

kylebarron commented Jan 23, 2025

jayzhan211 commented Jan 24, 2025 • edited Loading

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

findepi commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

Option 1: User defined operators

Option 2: Custom analyzer rules

alamb commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 25, 2025

Change `ReturnTypeInfo` to return a `Field` rather than `DataType` #14247

Change `ReturnTypeInfo` to return a `Field` rather than `DataType` #14247

alamb commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025 •

edited

Loading

jayzhan211 commented Jan 23, 2025 •

edited

Loading

jayzhan211 commented Jan 24, 2025 •

edited

Loading