Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ReturnTypeInfo to return a Field rather than DataType #14247

Open
alamb opened this issue Jan 23, 2025 · 19 comments
Open

Change ReturnTypeInfo to return a Field rather than DataType #14247

alamb opened this issue Jan 23, 2025 · 19 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 23, 2025

Is your feature request related to a problem or challenge?

It seems the design of Arrow extension types is nearing consensus and will arrive soon

The extension type information is encoded in an Arrow Field (doclink link) (which has both a DataType and the metadata information)

In this world, supporting a user function for a user defined type (e.g. a geometry type) I think would look like

  1. Creating a user defined function and declaring in the signature that it takes DataType::Binary
  2. Implementing the return_type_from_args function which would then try to get the user defined type information from the Binary column and verify it was correct

However, since the ReturnTypeInfo only provides DataType the the Field information will not be present and thus UDF writers will not be able to access extension type information

pub struct ReturnTypeArgs<'a> {

Describe the solution you'd like

Since we have not released return_type_from_args yet (it will be released in DataFusion 45) I would like to try and change the API before release to support user defined types

Describe alternatives you've considered

Specifically, I would like to pass in Field instead of DataType in ReturnTypeArgs

So instead of

pub struct ReturnTypeArgs<'a> {
    /// The data types of the arguments to the function
    pub arg_types: &'a [DataType],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
    /// Can argument `i` (ever) null?
    pub nullables: &'a [bool],
}

I think it would be better to be

pub struct ReturnTypeArgs<'a> {
    /// The schema fields of the arguments. Fields include DataType, nullability and other information.
    pub arg_fields: &'a [Field],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

Additional context

This was inspired by a comment from @milenkovicm on the DataFusion sync up call yesterday

@alamb alamb added the enhancement New feature or request label Jan 23, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

FYI @jayzhan211 and @kylebarron

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

I poked around with this a litte bit and I think it may be more invasive -- we probably have to clean up some other plumbing like ExprSchemable

I am not sure I have time to push on this particular issue at this time, but I wanted to file the issue

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

I think ReturnInfo would also need to be updated to return a Field rather than a subset:

/// Return metadata for this function.
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]
pub struct ReturnInfo {
    return_type: DataType,
    nullable: bool,
}

@milenkovicm
Copy link
Contributor

Just checking if we're on the same page @alamb

It would be great if this would could support use case like https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 23, 2025

Given we have to_field already, maybe Fields is easier to implement

pub struct ReturnTypeArgs<'a> {
    /// The schema fields of the arguments. Fields include DataType, nullability and other information.
    pub arg_fields: Fields,
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

We also need to add field method for PhysicalExpr

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

Given we have to_field already, maybe Fields is easier to implement

I agree that is a good idea

It would be great if this would could support use case like https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse

Yes @milenkovicm -- that is what I hope would be possible.

Note this particular ticket describes only part of what would be needed, I think several other APIs need to be updated to use Field rather than DataType as well, such as what @jayzhan211 identifies

We also need to add field method for PhysicalExpr

@paleolimbot
Copy link
Member

Just a note that this all quite a lot of work to go through to avoid adding an Extension member to the DataType enum. It seems like you could avoid a breaking change there by adding field.data_type_maybe_extension()?

@findepi
Copy link
Member

findepi commented Jan 23, 2025

However, since the ReturnTypeInfo only provides DataType the the Field information will not be present and thus UDF writers will not be able to access extension type information

i guess we have no option and at some point need to do the Arrow way, with the field metadata.
That implies that any type-dealing logic needs to take (DataType, metadata) pair as the input.
For example, geometry can use Utf8 container type, but it won't use Utf8's functions, casts or coercion rules. Once a field is marked to be "an extension type", it cannot be construed as its container type anymore!

We can go about this in two ways

(I am aware natural gravity always leans towards small increments, irrespective of end-game results)

@kylebarron
Copy link
Contributor

Just a note that this all quite a lot of work to go through to avoid adding an Extension member to the DataType enum.

Because the DataType enum is currently exhaustive, adding an Extension variant in itself would be a breaking change.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 24, 2025

try to evolve incrementally, keep using DataType in many places, but add metadata to more and more places incrementally. At the end of this, the code is not beautiful (rather hard to reason about), and in the meantime it's not correct

Evolving incrementally is the way to go. logical-types is an example that it is quite an challenge for contributor and reviewer to keep it up until it lands.

Consider the logical type support, we might switch to LogicalType trait in the future, we can implement LogicalType for those extension types. In this case, we don't need Field.

// function input should be logical concept
pub struct ReturnTypeArgs<'a> {
    /// The data types of the arguments to the function
    pub arg_types: &'a [Arc<dyn LogicalType>],
    /// ...
    pub scalar_arguments: &'a [Option<&'a ScalarValue>],
}

// return information should be physical concept
pub struct ReturnInfo {
  // Physical Type concept
  field: Field
}

However, given logical types is still work in progress, and not clear whether the assumption and the design makes sense at all, I think switching to Field and incrementally evolve would be a better choice

@findepi
Copy link
Member

findepi commented Jan 24, 2025

Using Field directly doesn't yet answer the question "how to interpret given field".
It's possible to poke at field.data_type() ignoring metadata, or check for specific metadata keys that may override how the field should be interpreted. Doing this in every place separately is unwieldy. So we gonna have a helper function that takes a Field and returns... "the type object" representing the type of that field -- either arrow DataType or extension.

Given we will have this "the type object" anyway, we should define it upfront and just use in ReturnTypeArgs.
Maybe we can use the LogicalType trait we already have on main for that purpose.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

Using Field directly doesn't yet answer the question "how to interpret given field".

I agree -- but at least it makes it possible to get the info

So we gonna have a helper function that takes a Field and returns... "the type object" representing the type of that field -- either arrow DataType or extension.

If the goal is to have implementation of extension types in the core of datafusion, I agree with this statement and that is an excellent point.

However, in my mind the goal is actually to have almost no specific extension type support in DataFusion itself, but instead have all the implementations live elsewhere (e.g. all geospatial packages live in some other repo, and there is no geospatial knowledge in the core of DataFusion)

@findepi
Copy link
Member

findepi commented Jan 24, 2025

I am fine with DF not shipping extension types (ie no extension types until we add them explicitly in #12644).
Let's look at the example. I have Field information for a and b columns, both are DataType::Binary.
What should be the behavior of a = b according to DF core's logic?
Naive answer would be that given same DataType, the expression is valid and Arrow's comparison function should be used.
However, if one (or both) of then happen to be custom-provided extension type (e.g. JSON, VARIANT, ST_Geometry), this logic should not be used at all, not even as a fallback.

So the bare minimum is -- given a Field, DF core needs to understand whether this is a type it knows about or a type it doesn't know about.... So we're almost back to explicit extension types.

@paleolimbot
Copy link
Member

Also a note that using a Field would require a serialization/deserialization every time the extension type is used (whereas some core "datatype" based on the ExtensionType is parsed once on the way in and serialized once on the way out.

What should be the behavior of a = b according to DF core's logic?

Even if it doesn't know about extension types, it could always calculate equality for dispatch on extension.name and equality otherwise based on extension.name and extension.metadata being byte-for-byte equal. (An ExtensionType implementation, perhaps registered with the session, could provide smarter logic).

@findepi
Copy link
Member

findepi commented Jan 24, 2025

If we check & interpret the extension.name, then we're at home:

Extension {
name: &'a str,
parameters: &'a [TypeParameter<'a>],
},

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

I am fine with DF not shipping extension types (ie no extension types until we add them explicitly in #12644). Let's look at the example. I have Field information for a and b columns, both are DataType::Binary. What should be the behavior of a = b according to DF core's logic? Naive answer would be that given same DataType, the expression is valid and Arrow's comparison function should be used.

I agree that the code as written today would compare the columns as binary rather than the user defined type.

However, if one (or both) of then happen to be custom-provided extension type (e.g. JSON, VARIANT, ST_Geometry), this logic should not be used at all, not even as a fallback.

I also agree with this.

So the bare minimum is -- given a Field, DF core needs to understand whether this is a type it knows about or a type it doesn't know about.... So we're almost back to explicit extension types.

Here are some possible ways we could support = on a user defined type

Option 1: User defined operators

In this case, we would let users override = similar to a UDF, implementing support for extension types. The users implementation could then fall back into the built in = implementation when it didn't have special rules

I think this might get tricky when multiple extension types were used (it might be hard to hook json and geometry without a bunch of glue code)

Option 2: Custom analyzer rules

I this case the extension could add a custom ANalyzer rule that walked over all plan Exprs, finding any that were relevant to the user defined type and rewriting the expressions to use a function (eg. rewrite geo_col1 = geo_col2 into udf_compare_geos(geo_col1, geo_col2)

This might not be ideal as there would likely be a lot of replicated code in extensions (like matching on equality)

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

Also a note that using a Field would require a serialization/deserialization every time the extension type is used (whereas some core "datatype" based on the ExtensionType is parsed once on the way in and serialized once on the way out.

I think we could avoid this by doing a Expr rewrite early on to "resolve" the extension types into function calls and from then on no additional deserialization would be needed 🤔 Though I may be misunderstanding something

@findepi
Copy link
Member

findepi commented Jan 24, 2025

Yes, once plan is lowered into "container" arrow types (like assembly), we no longer need to remember what were the logical/extension types. Before the lowering happens, the functions and operators need to be resolved. This doesn't happen at the Expr construction time, though, so it IMO calls for a strict separation of phases:

  1. Exprs are syntactical (eg created directly from SQL syntax or dataframe API).
  2. Then analyzer needs to "resolve" operators. This needs to be type aware. E.g. for builtin types it can use = operators, but for other it needs to be told how to compare comparisons (and such), so a UDF gets inserted into the plan.
    • After this phase, the plan is "resolved" and doesn't need to remember the original types (except maybe for output fields metadata).

I think this might get tricky when multiple extension types were used (it might be hard to hook json and geometry without a bunch of glue code)

Extensible coercion rules is a tricky thing indeed. Maybe we can leave without them (for now)

But there are simpler thing to solve as well, like casts: If "my JSON" type uses DataType::Binary as its container type, it still wants to define its own family of casts to various other types (numbers, text, etc.). So the Cast Expr would need to resolve to some UDF, when source type or target type are not native types.

finding any that were relevant to the user defined type and rewriting the expressions to use a function (eg. rewrite geo_col1 = geo_col2 into udf_compare_geos(geo_col1, geo_col2)

That sounds easy because we don't have to write this logic even once.
But once such logic is written somewhere, there is no reason for it not to be part of datafusion project, for the benefit of all consumers. I think such logic should belong to datafusion.

@alamb
Copy link
Contributor Author

alamb commented Jan 25, 2025

But once such logic is written somewhere, there is no reason for it not to be part of datafusion project, for the benefit of all consumers. I think such logic should belong to datafusion.

I agree with this (and you other points mostly). I do think it would likely be wise to sort out what that logic looks like outside the core crate (as a way to ensure we know what the APIs should look like). Once we have some examples putting the APIs in the core makes a lot of sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants