-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for collect_list Spark aggregate function #9231
Conversation
✅ Deploy Preview for meta-velox canceled.
|
059061f
to
c428066
Compare
@mbasmanova Could you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 Would you add documentation for this function?
c428066
to
c4dd224
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 Looks good overall % comments for the tests.
{"c0", "array_sort(a0)"}, | ||
"SELECT c0, array_sort(array_agg(a)" | ||
"filter (where a is not null)) FROM tmp GROUP BY c0"); | ||
testAggregationsWithCompanion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reasons companion function testing is not included as part of testAggregations? testAggregationsWithCompanion calls appear too verbose and repetitive.
CC: @kagamiori
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 Would you take a look at this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I think the possible reason is that some aggregate functions have not registered the companion functions due to certain restrictions, such as when isResultTypeResolvableGivenIntermediateType
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand is why we need to pass [](auto& /*builder*/) {},
and {{BIGINT()}},
to testAggregationsWithCompanion and why do we need to call both testAggregations and testAggregationsWithCompanion.
Why can't we just call
testAggregationsWithCompanion(
batches,
{"c0"},
{"spark_collect_list(c1)"},
{"c0", "array_sort(a0)"},
"SELECT c0, array_sort(array_agg(c1)"
"filter (where c1 is not null)) FROM tmp GROUP BY c0");
and have it test both regular functions as well as companion functions.
CC: @kagamiori
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is better, and the config parameter is also not necessary. Right now, many tests are calling testAggregations
followed by testAggregationsWithCompanion
. We need to combine these two test functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this refactoring in a follow-up.
velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp
Outdated
Show resolved
Hide resolved
{}, | ||
{"spark_collect_list(c0)"}, | ||
{"array_sort(a0)"}, | ||
"SELECT array_sort(array_agg(c0)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simple cases like this, it might be better to provide expected results:
auto expected = makeRowVector({
makeArrayVectorFromJson<int32_t>({"[1, 2, 4, 5]"});
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this would be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Will the result become unstable if we do not use agg_sort
?
Failed
Expected 1, got 1
1 extra rows, 1 missing rows
1 of extra rows:
[4,1,5,2]
1 of missing rows:
[1,2,4,5]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can I assume that the output will remain stable as [4,1,5,2]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still use array_sort to ensure results are stable:
testAggregations(
{data},
{},
{"spark_collect_list(c0)"},
{"array_sort(a0)"},
{expected});
velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp
Outdated
Show resolved
Hide resolved
velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp
Outdated
Show resolved
Hide resolved
c8cd7a2
to
d0a9b2f
Compare
@@ -88,6 +88,9 @@ int main(int argc, char** argv) { | |||
// coefficient. Meanwhile, DuckDB employs the sample kurtosis calculation | |||
// formula. The results from the two methods are completely different. | |||
"kurtosis", | |||
// When all data in a group are null, Spark returns an empty array while | |||
// DuckDB returns null. | |||
"collect_list", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I think this function should not be compared with DuckDB. If the fuzzer generates a group where all the data is null, DuckDB's result will be null, while Spark will return an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We need to change Fuzzer to verify results against Spark, not DuckDB: #9270
@mbasmanova Addressed the comments for the tests. |
d0a9b2f
to
36a7595
Compare
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
#include <random> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these includes needed. Looks like some can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have cleaned up the "include" and "using namespace" statements.
velox/functions/sparksql/aggregates/tests/CollectListAggregateTest.cpp
Outdated
Show resolved
Hide resolved
{"c0", "array_sort(a0)"}, | ||
"SELECT c0, array_sort(array_agg(a)" | ||
"filter (where a is not null)) FROM tmp GROUP BY c0"); | ||
testAggregationsWithCompanion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 Would you take a look at this comment?
{}, | ||
{"spark_collect_list(c0)"}, | ||
{"array_sort(a0)"}, | ||
"SELECT array_sort(array_agg(c0)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this would be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -34,6 +35,7 @@ target_link_libraries( | |||
velox_functions_aggregates_test_lib | |||
velox_functions_spark_aggregates | |||
velox_hive_connector | |||
velox_vector_fuzzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I noticed that there's an omission here that hasn't been removed. I have removed it. Please help to re-import.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 1ba16a9. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…tor#9231) Summary: The semantics of Spark's `collect_list` and Presto's `array_agg` are generally consistent, but there are inconsistencies in the handling of null values. Spark always ignores null values in the input, whereas Presto has a parameter that controls whether to retain them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty array. Because of these differences, we need to re-implement the `array_agg` function for Spark. Pull Request resolved: facebookincubator#9231 Reviewed By: xiaoxmeng Differential Revision: D55639676 Pulled By: mbasmanova fbshipit-source-id: 958471779a1fa66dba27569a6c12538ad5489f46
#9361) Summary: In #9231, `collect_list` is added to the disable list of `duckQueryRunner`. However, this is unnecessary because DuckDB does not have an aggregate function named `collect_list`, hence it would not be compared against DuckDB. This setting is redundant. Other than this, the results verification of `collect_list` has been set to `nullptr`, so its results are not verified. But we can use a custom array verifier used by Presto's `array_agg` to check the results of itself. Pull Request resolved: #9361 Reviewed By: xiaoxmeng Differential Revision: D55744044 Pulled By: mbasmanova fbshipit-source-id: a1a94c58b2a01463261775d8b6e08b65fd986d29
…tor#9231) Summary: The semantics of Spark's `collect_list` and Presto's `array_agg` are generally consistent, but there are inconsistencies in the handling of null values. Spark always ignores null values in the input, whereas Presto has a parameter that controls whether to retain them. Moreover, Presto returns null when all inputs are null, while Spark returns an empty array. Because of these differences, we need to re-implement the `array_agg` function for Spark. Pull Request resolved: facebookincubator#9231 Reviewed By: xiaoxmeng Differential Revision: D55639676 Pulled By: mbasmanova fbshipit-source-id: 958471779a1fa66dba27569a6c12538ad5489f46
facebookincubator#9361) Summary: In facebookincubator#9231, `collect_list` is added to the disable list of `duckQueryRunner`. However, this is unnecessary because DuckDB does not have an aggregate function named `collect_list`, hence it would not be compared against DuckDB. This setting is redundant. Other than this, the results verification of `collect_list` has been set to `nullptr`, so its results are not verified. But we can use a custom array verifier used by Presto's `array_agg` to check the results of itself. Pull Request resolved: facebookincubator#9361 Reviewed By: xiaoxmeng Differential Revision: D55744044 Pulled By: mbasmanova fbshipit-source-id: a1a94c58b2a01463261775d8b6e08b65fd986d29
The semantics of Spark's
collect_list
and Presto'sarray_agg
aregenerally consistent, but there are inconsistencies in the handling of null
values. Spark always ignores null values in the input, whereas Presto has a
parameter that controls whether to retain them. Moreover, Presto returns null
when all inputs are null, while Spark returns an empty array.
Because of these differences, we need to re-implement the
array_agg
function for Spark.