You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you have a column that is created using array_agg and the column name is the same as the column name of the original data, calling schema() produces a result that is based on the original data rather than the aggregate data.
To be more clear, if you call something like .aggregate(vec![col("a"), col("b")], vec![array_agg(col("c")).alias("c")]) and you print the schema, followed by a select_columns("c") you will get a different schema displayed. This only happens if the columnc above is a computed value, even if it is as simple as an expression col("c").alias("c").
The minimal example below shows this.
To Reproduce
The following code demonstrates that selecting the column changes the displayed schema.
Commenting here for myself to continue looking further, but I suspect this is happening in exprlist_to_fields in datafusion/expr/src/utils.rs where we look towards the input fields of aggregates. Because this type of aggregate does not produce the same type as the input field is likely the culprit.
Quickly took a look at another aggregate function that can return a different type. By calling count_distinct instead of array_agg and updating the input file to have strings instead of int64 values I verified it is also outputting the incorrect schema after doing the select operation.
timsaucer
changed the title
Schema incorrect after select over aggregate of expr
Schema incorrect after select over aggregate function that returns a different type than the input
May 2, 2024
There are two columns named 'c', one from the aggregated input and the other from the output. exprlist_to_fields_aggregate forcibly uses the column 'c' from the input, which is of type Int64.
Describe the bug
When you have a column that is created using
array_agg
and the column name is the same as the column name of the original data, callingschema()
produces a result that is based on the original data rather than the aggregate data.To be more clear, if you call something like
.aggregate(vec![col("a"), col("b")], vec![array_agg(col("c")).alias("c")])
and you print the schema, followed by aselect_columns("c")
you will get a different schema displayed. This only happens if the columnc
above is a computed value, even if it is as simple as an expressioncol("c").alias("c")
.The minimal example below shows this.
To Reproduce
The following code demonstrates that selecting the column changes the displayed schema.
The input csv is a simple table:
Produces the following output:
You can see the final table shown is correct, but the displayed schema is not.
Expected behavior
Schema should remain invariant under trivial
select
ofcol
orselect_columns
.Additional context
No response
The text was updated successfully, but these errors were encountered: