Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45171][SQL] Initialize non-deterministic expressions in GenerateExec #42933

Closed
wants to merge 4 commits into from

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Before evaluating the generator function in GenerateExec, initialize non-deterministic expressions.

Why are the changes needed?

The following query fails:

select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);

23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
	at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
	at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...        

However, this query succeeds:

select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);

0
1
2
3
...
801
802
803

The difference is that transform turns off whole-stage codegen, which exposes a bug in GenerateExec in which the non-deterministic expression passed to the generator function is not initialized before being used.

This PR fixes the bug in GenerateExec.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 14, 2023
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this initialize(index) thing has many wholes. I am fine with this as a bandaid fix for now - I did it before too in EvalPythonUDTFExec.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.5.

HyukjinKwon added a commit that referenced this pull request Sep 15, 2023
…ateExec`

### What changes were proposed in this pull request?

Before evaluating the generator function in `GenerateExec`, initialize non-deterministic expressions.

### Why are the changes needed?

The following query fails:
```
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);

23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
	at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
	at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
	at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
	at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
	at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
```
However, this query succeeds:
```
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);

0
1
2
3
...
801
802
803
```
The difference is that `transform` turns off whole-stage codegen, which exposes a bug in `GenerateExec` in which the non-deterministic expression passed to the generator function is not initialized before being used.

This PR fixes the bug in `GenerateExec`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42933 from bersprockets/nondeterm_issue.

Lead-authored-by: Bruce Robbins <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit e097f91)
Signed-off-by: Hyukjin Kwon <[email protected]>
@bersprockets bersprockets deleted the nondeterm_issue branch October 13, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants