[SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec` #42933

bersprockets · 2023-09-14T23:02:16Z

What changes were proposed in this pull request?

Before evaluating the generator function in GenerateExec, initialize non-deterministic expressions.

Why are the changes needed?

The following query fails:

select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);

23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
	at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
	at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...

However, this query succeeds:

select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);

0
1
2
3
...
801
802
803

The difference is that transform turns off whole-stage codegen, which exposes a bug in GenerateExec in which the non-deterministic expression passed to the generator function is not initialized before being used.

This PR fixes the bug in GenerateExec.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

…eExec.scala

HyukjinKwon

I know this initialize(index) thing has many wholes. I am fine with this as a bandaid fix for now - I did it before too in EvalPythonUDTFExec.

HyukjinKwon · 2023-09-15T04:22:30Z

Merged to master and branch-3.5.

…ateExec` ### What changes were proposed in this pull request? Before evaluating the generator function in `GenerateExec`, initialize non-deterministic expressions. ### Why are the changes needed? The following query fails: ``` select * from explode( transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) ); 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) ... ``` However, this query succeeds: ``` select * from explode( sequence(0, cast(rand()*1000 as int) + 1) ); 0 1 2 3 ... 801 802 803 ``` The difference is that `transform` turns off whole-stage codegen, which exposes a bug in `GenerateExec` in which the non-deterministic expression passed to the generator function is not initialized before being used. This PR fixes the bug in `GenerateExec`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42933 from bersprockets/nondeterm_issue. Lead-authored-by: Bruce Robbins <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit e097f91) Signed-off-by: Hyukjin Kwon <[email protected]>

bersprockets added 3 commits September 14, 2023 09:03

testing

35be770

update

cb41e62

Update test name

71c9583

github-actions bot added the SQL label Sep 14, 2023

HyukjinKwon reviewed Sep 14, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala Outdated Show resolved Hide resolved

Update sql/core/src/main/scala/org/apache/spark/sql/execution/Generat…

fdc4afa

…eExec.scala

HyukjinKwon approved these changes Sep 15, 2023

View reviewed changes

HyukjinKwon closed this in e097f91 Sep 15, 2023

bersprockets deleted the nondeterm_issue branch October 13, 2023 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec` #42933

[SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec` #42933

bersprockets commented Sep 14, 2023

HyukjinKwon left a comment

HyukjinKwon commented Sep 15, 2023

[SPARK-45171][SQL] Initialize non-deterministic expressions in GenerateExec #42933

[SPARK-45171][SQL] Initialize non-deterministic expressions in GenerateExec #42933

Conversation

bersprockets commented Sep 14, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 15, 2023

[SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec` #42933

[SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec` #42933