Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Spark ArraySort with lambda function #10138

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Jun 11, 2024

Support Spark array_sort to allow lambda function to sort elements.

Since Spark has different comparisons implementation than presto(see #5569), we can't directly reuse presto array_sort logic to rewrite lambda function to a simple comparator if possible.

This pr tries to:

  1. Move Presto array_sort to velox/functions/lib whereas both presto and spark can use it
  2. Add a new option nullsFirst to support nulls to be placed at the start of the array(to support spark function sort_array
  3. Extract the common logic of SimpleComparisonMatcher and move it to velox/functions/lib, and create different SimpleComparisonChecker for spark and presto to do the comparison match(e.g, = is eq in presto, but equalto in spark)
  4. Add tests to cover spark rewrite function logic

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2024
Copy link

netlify bot commented Jun 11, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f800969
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67a48ee131dd6000080c8692

return rewritten;
}

VELOX_USER_FAIL(kNotSupported, lambda->toString())
Copy link
Contributor Author

@boneanxs boneanxs Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to throw error out if the rewrite is not possible for spark? I follow the presto's logic, but not sure it's necessary for spark.

@boneanxs
Copy link
Contributor Author

@PHILO-HE @rui-mo Hey, could you pls review this? We need this refactor to support spark array_sort function

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Spark has different comparisons implementation than presto

Could we add a block the PR description to describe the semantic difference?

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SortArrayTest.cpp Outdated Show resolved Hide resolved
@boneanxs
Copy link
Contributor Author

Since Spark has different comparisons implementation than presto

Could we add a block the PR description to describe the semantic difference?

@rui-mo done, address the comments, pls have a look again.

@rui-mo
Copy link
Collaborator

rui-mo commented Jun 14, 2024

Since Spark has different comparisons implementation than presto(see #5569)

@boneanxs I wonder if we are tackling the difference of NaN semantics in this PR. There is a plan in Velox to adjust its semantics, and some PRs have been merged. Perhaps we can fix the Presto function directly, seeing #7237.

@boneanxs
Copy link
Contributor Author

boneanxs commented Jun 14, 2024

Since Spark has different comparisons implementation than presto(see #5569)

@boneanxs I wonder if we are tackling the difference of NaN semantics in this PR. There is a plan in Velox to adjust its semantics, and some PRs have been merged. Perhaps we can fix the Presto function directly, seeing #7237.

@rui-mo We might still need a special spark rewrite arraySort logic even the NaN semantics difference is fixed, given:

  1. We have different comparison implementation now, gluten maps spark's implementation instead of presto's, it could throw errors since presto's rewriteArraySortCall can't correctly identify spark's comparison names.

e.g. for the expression array_sort(array(), (left, right) -> if (left > right, 1, if(left < right, -1, 0))), gluten could transform the expression to array_sort(array(), lambda ROW<left:INTEGER,right:INTEGER> -> if(greaterthan(left, right), 1, if(lessthan(left, right), -1, 0))), while presto side can only identify array_sort(array(), lambda ROW<left:INTEGER,right:INTEGER> -> if(gt(left, right), 1, if(lt(left, right), -1, 0)))(The function names are different now).

Do we have plan to fix all spark comparison functions after the NaN semantics is unified? Also, possibly other comparison functions could have other semantic difference than NaN? And I'm thinking it might be a long term to address it?

  1. Is it possible spark could have other funtions that is different from presto and we need to specially handle it in rewriteArraySort, given many functions have different meaning and some functions independently inside spark, it's possible we can do some optimization inside spark while presto doesn't support or vice visa. So separating the rewriteArraySort can give us this possibility.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/functions/sparksql/SimpleComparisonChecker.h Outdated Show resolved Hide resolved
velox/functions/sparksql/SimpleComparisonChecker.h Outdated Show resolved Hide resolved
velox/functions/sparksql/ArraySort.cpp Outdated Show resolved Hide resolved
@boneanxs boneanxs force-pushed the array_sort branch 2 times, most recently from b209dd1 to 901aae4 Compare June 20, 2024 02:30
@boneanxs boneanxs requested a review from rui-mo June 20, 2024 02:31
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added several comments.

velox/functions/lib/ArraySort.cpp Show resolved Hide resolved
velox/functions/lib/ArraySort.cpp Show resolved Hide resolved
velox/functions/lib/ArraySort.cpp Show resolved Hide resolved
velox/functions/lib/ArraySort.cpp Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Show resolved Hide resolved
velox/functions/sparksql/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/sparksql/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/sparksql/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
@boneanxs boneanxs changed the title Add rewriteArraySort for Spark Support ArraySort for Spark Jul 2, 2024
@boneanxs boneanxs requested a review from rui-mo July 2, 2024 09:02
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added several questions.

velox/docs/functions/spark/array.rst Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Show resolved Hide resolved
prefix + "array_sort", arraySortSignatures(), makeArraySort);
prefix + "array_sort", arraySortSignatures(true), makeArraySortAsc);
exec::registerStatefulVectorFunction(
prefix + "array_sort_desc", arraySortDescSignatures(), makeArraySortDesc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have corresponding function for array_sort_desc in Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have array_sort_desc in Spark, but this is required since rewriteArraySort need it:

: prefix + "array_sort_desc";

velox/docs/functions/spark/array.rst Show resolved Hide resolved
@boneanxs boneanxs force-pushed the array_sort branch 3 times, most recently from 5c0e1f4 to 80fa6c3 Compare July 19, 2024 09:52
@boneanxs boneanxs changed the title Support ArraySort for Spark Support Spark ArraySort with lambda function Jul 19, 2024
@boneanxs boneanxs requested a review from rui-mo July 23, 2024 01:35
velox/docs/functions/spark/array.rst Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.cpp Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.cpp Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/sparksql/ArraySort.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/ArraySort.cpp Outdated Show resolved Hide resolved
Copy link

stale bot commented Oct 24, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Oct 24, 2024
@rui-mo
Copy link
Collaborator

rui-mo commented Oct 25, 2024

@boneanxs Would you like to update this PR? Thanks.

@stale stale bot removed the stale label Oct 25, 2024
@boneanxs
Copy link
Contributor Author

@boneanxs Would you like to update this PR? Thanks.

Oh, forgot it. Sure, will update it recently

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the documentation.

:noindex:

Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL or NaN for floating type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for lambda function returned values, Nan is not handled. Do we need to handle NaN since it shouldn't returned by lambda functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle NaN since it shouldn't returned by lambda functions

Hi @boneanxs, could you provide more details on why NaN shouldn't be returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I overlooked this before, for array_sort with lambda functions, it supports sorting with NaN in SimpleVector.comparePrimitiveAsc and follows the logic of NaN is before NULL, I also add a test to cover this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain more here. After looking into the presto/spark implementation, they both say

It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. If the comparator function returns other values (including NULL), the query will fail and raise an error

see presto and spark (Though spark says it doesn't support returning null values, but it doesn't throw errors for query like SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark3.2 which might be a bug)

So null values and NaN shouldn't be return for lambda function function(T,T, int), and in SimpleComparisonMatcher, we do the match that the return value must be int.

SimpleComparisonMatcher could optimize function(T,T, int) to function(T, U) where U is orderable(not limited to int), it's possible that it creates float values, such as function(float, float, int): IF( x > y, 1, IF(x < y, -1, 0)) will be optimized to function(float, float): x -> x, at such point, they should still be the same since both goes into SimpleVector.compare to do the comparison(except NULLs are filtered in ArraySort.sortElements in advance to respect nullsFirst flag). And inside SimpleVector.compare, NaN is smaller than Null.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark 3.5 and got below exception. Would you like the add a unit test for this case to make sure exception is thrown?

Caused by: org.apache.spark.SparkException: [COMPARATOR_RETURNS_NULL] The comparator has returned a NULL for a comparison between dc and dc. It should return a positive integer for "greater than", 0 for "equal" and a negative integer for "less than". To revert to deprecated behavior where NULL is treated as 0 (equal), you must set "spark.sql.legacy.allowNullComparisonResultInArraySort" to "true".

I also notice Spark requires the function must return integer type, would you like to confirm?
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L412-L421

velox/docs/functions/spark/array.rst Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. Some minors and the others look good!

velox/functions/sparksql/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
:noindex:

Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle NaN since it shouldn't returned by lambda functions

Hi @boneanxs, could you provide more details on why NaN shouldn't be returned?

velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/ArraySortTest.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SortArrayTest.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/ArraySortTest.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boneanxs Thanks for iterating. Would you also rebase this PR?

:noindex:

Returns the array sorted by values computed using specified lambda in ascending order. ``U`` must be an orderable type.
Null/NaN elements returned by the lambda function will be placed at the end of the returned array, with NaN elements appearing before Null elements. This functions is not supported in Spark and is only used inside velox. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify the purpose in the document.

used inside velox for rewring :spark:func:`xxx` as :spark:func:`xxx`.

@@ -35,6 +36,24 @@ class ArraySortTest : public SparkFunctionBaseTest {
assertEqualVectors(expected, result);
}

void testArraySort(
const std::string& lamdaExpr,
const bool asc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop const when passing by value.

Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.
The function attempts to analyze the lambda function and rewrite it into a simpler call that
specifies the sort-by expression (like :spark:func:`array_sort(array(T), function(T,U)) -> array(T)`). For example, ``(left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))`` will be rewritten to ``x -> length(x)``. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify the behavior when rewrite is not possible.

@@ -140,5 +163,50 @@ TEST_F(ArraySortTest, constant) {
expected = makeConstantArray<int64_t>(size, {6, 6, 6, 6});
assertEqualVectors(expected, result);
}

TEST_F(ArraySortTest, lambda) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to add test for the case when rewriting is not possible?

:noindex:

Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark 3.5 and got below exception. Would you like the add a unit test for this case to make sure exception is thrown?

Caused by: org.apache.spark.SparkException: [COMPARATOR_RETURNS_NULL] The comparator has returned a NULL for a comparison between dc and dc. It should return a positive integer for "greater than", 0 for "equal" and a negative integer for "less than". To revert to deprecated behavior where NULL is treated as 0 (equal), you must set "spark.sql.legacy.allowNullComparisonResultInArraySort" to "true".

I also notice Spark requires the function must return integer type, would you like to confirm?
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L412-L421

@boneanxs
Copy link
Contributor Author

@rui-mo can see apache/incubator-gluten#8526, added tests passed,

Also, it will be fallback if can't be rewritable.

2025-01-14T04:03:49.5216547Z - array_sort with lambda functions
2025-01-14T04:03:49.5478095Z 04:03:49.547 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.5481453Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x_12:MAP<VARCHAR,INTEGER>,y_13:MAP<VARCHAR,INTEGER>> -> subtract(size("x_12",true),size("y_13",true)).
2025-01-14T04:03:49.5542142Z 04:03:49.553 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.5545407Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x_12:MAP<VARCHAR,INTEGER>,y_13:MAP<VARCHAR,INTEGER>> -> subtract(size("x_12",true),size("y_13",true)).
2025-01-14T04:03:49.6574296Z 04:03:49.656 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.6577446Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x:MAP<VARCHAR,INTEGER>,y:MAP<VARCHAR,INTEGER>> -> subtract(size("x",true),size("y",true)).
2025-01-14T04:03:49.6660157Z 04:03:49.665 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.6663312Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x:MAP<VARCHAR,INTEGER>,y:MAP<VARCHAR,INTEGER>> -> subtract(size("x",true),size("y",true)).

@boneanxs boneanxs requested a review from rui-mo January 14, 2025 06:50
Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.
The function attempts to analyze the lambda function and rewrite it into a simpler call that
specifies the sort-by expression (like :spark:func:`array_sort(array(T), function(T,U)) -> array(T)`). For example, ``(left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))`` will be rewritten to ``x -> length(x)``. If rewrite is not possible, a user error will be thrown ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boneanxs Thanks for your continuous work! Would you document the limitation of current rewriting on the null handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rui-mo rui-mo changed the title Support Spark ArraySort with lambda function feat: Support Spark ArraySort with lambda function Jan 24, 2025
@boneanxs
Copy link
Contributor Author

Hey @rui-mo , any more comments for this?

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the added doc. Could you compile the rst file to see the added content is well displayed (including hyper link) in the generated doc? Thanks!


SELECT array_sort(array(1, 2, 3)); -- [1, 2, 3]
SELECT array_sort(array(3, 2, 1)); -- [1, 2, 3]
SELECT array_sort(array(2, 1, NULL); -- [1, 2, NULL]
SELECT array_sort(array(NULL, 1, NULL)); -- [1, NULL, NULL]
SELECT array_sort(array(NULL, 2, 1)); -- [1, 2, NULL]
SELECT array_sort(array(4.0, NULL, float('nan'), 3.0)); -- [3.0, 4.0, NaN, NULL]
SELECT array_sort(array(array(), array(1, 3, NULL), array(NULL, 6), NULL, array(2,1))); -- [[], [NULL, 6], [1, 3, NULL], [2, 1], NULL]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: code style, insert a space after , in array(2,1).

:noindex:

Returns the array sorted by values computed using specified lambda in ascending order. ``U`` must be an orderable type.
Null/NaN elements returned by the lambda function will be placed at the end of the returned array, with NaN elements appearing before Null elements. This functions is not supported in Spark and is only used inside velox for rewring :spark:func: array_sort(array(T), function(T,T,U)) as :spark:func: array_sort(array(T), function(T,U)). ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: rewring.

Please break this long line into several lines. Ditto for other applicable places.

This functions is not supported in Spark and is only used inside velox for rewring :spark:func: array_sort(array(T), function(T,T,U)) as :spark:func: array_sort(array(T), function(T,U)).

Typo: functions

:noindex:

Returns the array sorted by values computed using specified lambda in ascending
order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (if my understanding is right):

If the value from the lambda function is NULL, the element will be placed at the end.
->
If the lambda function returns NULL, the corresponding element will be placed at the end.

Please note that due to this rewrite optimization, there is a difference in null handling logic between Spark and Velox. In Velox, null elements are always placed at the end of the returned array, whereas in Spark, Java comparison logic is used to sort nulls with other elements. ::

SELECT array_sort(array('cat', 'leopard', 'mouse'), (left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))); -- ['cat', 'mouse', 'leopard']
select array_sort(array("abcd123", "abcd", null, "abc"), (left, right) -> if(length(left)>length(right), 1, if(length(left)<length(right), -1, 0))); -- ["abc", "abcd", "abcd123", null]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style: leave space before/after > or <.


SELECT array_sort(array('cat', 'leopard', 'mouse'), (left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))); -- ['cat', 'mouse', 'leopard']
select array_sort(array("abcd123", "abcd", null, "abc"), (left, right) -> if(length(left)>length(right), 1, if(length(left)<length(right), -1, 0))); -- ["abc", "abcd", "abcd123", null]
select array_sort(array("abcd123", "abcd", null, "abc"), (left, right) -> if(length(left)>length(right), 1, if(length(left)=length(right), 0, -1))); -- ["abc", "abcd", "abcd123", null] different with Spark: ["abc", null, "abcd", "abcd123"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.
The function attempts to analyze the lambda function and rewrite it into a simpler call that
specifies the sort-by expression (like :spark:func:`array_sort(array(T), function(T,U)) -> array(T)`). For example, ``(left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))`` will be rewritten to ``x -> length(x)``. If rewrite is not possible, a user error will be thrown.
Please note that due to this rewrite optimization, there is a difference in null handling logic between Spark and Velox. In Velox, null elements are always placed at the end of the returned array, whereas in Spark, Java comparison logic is used to sort nulls with other elements. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

whereas in Spark, it depends on the comparison logic to compare null with other elements.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @boneanxs, I added some nits. And do we have feedback for #10138 (comment)?

@@ -134,13 +134,34 @@ Array Functions
.. spark:function:: array_sort(array(E)) -> array(E)

Returns an array which has the sorted order of the input array(E). The elements of array(E) must
be orderable. Null elements will be placed at the end of the returned array. ::
be orderable. Null/NaN elements will be placed at the end of the returned array, with NaN elements appearing before Null elements for float types. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Null -> NULL
Null/NaN -> NULL and NaN
float types -> floating-point types

Ditto for the others.

@@ -134,13 +134,34 @@ Array Functions
.. spark:function:: array_sort(array(E)) -> array(E)

Returns an array which has the sorted order of the input array(E). The elements of array(E) must
be orderable. Null elements will be placed at the end of the returned array. ::
be orderable. Null/NaN elements will be placed at the end of the returned array, with NaN elements appearing before Null elements for float types. ::

SELECT array_sort(array(1, 2, 3)); -- [1, 2, 3]
SELECT array_sort(array(3, 2, 1)); -- [1, 2, 3]
SELECT array_sort(array(2, 1, NULL); -- [1, 2, NULL]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One ) is missing in this example. Not your change, but could you help fix it? Thanks.

:noindex:

Returns the array sorted by values computed using specified lambda in ascending order. ``U`` must be an orderable type.
Null/NaN elements returned by the lambda function will be placed at the end of the returned array, with NaN elements appearing before Null elements. This functions is not supported in Spark and is only used inside velox for rewring :spark:func: array_sort(array(T), function(T,T,U)) as :spark:func: array_sort(array(T), function(T,U)). ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for rewring :spark:func: array_sort(array(T), function(T,T,U)) as :spark:func: array_sort(array(T), function(T,U))

The two references do not work well. Perhaps refer the style for array_sort on L159.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants