You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-37915][SQL] Combine unions if there is a project between them
### What changes were proposed in this pull request?
This pr makes `CombineUnions` combine unions if there is a project between them. For example:
```scala
spark.range(1).selectExpr("CAST(id AS decimal(18, 1)) AS id").write.saveAsTable("t1")
spark.range(2).selectExpr("CAST(id AS decimal(18, 2)) AS id").write.saveAsTable("t2")
spark.range(3).selectExpr("CAST(id AS decimal(18, 3)) AS id").write.saveAsTable("t3")
spark.range(4).selectExpr("CAST(id AS decimal(18, 4)) AS id").write.saveAsTable("t4")
spark.range(5).selectExpr("CAST(id AS decimal(18, 5)) AS id").write.saveAsTable("t5")
spark.sql("SELECT id FROM t1 UNION SELECT id FROM t2 UNION SELECT id FROM t3 UNION SELECT id FROM t4 UNION SELECT id FROM t5").explain(true)
```
Before this pr:
```
== Optimized Logical Plan ==
Aggregate [id#36], [id#36]
+- Union false, false
:- Aggregate [id#34], [cast(id#34 as decimal(22,5)) AS id#36]
: +- Union false, false
: :- Aggregate [id#32], [cast(id#32 as decimal(21,4)) AS id#34]
: : +- Union false, false
: : :- Aggregate [id#30], [cast(id#30 as decimal(20,3)) AS id#32]
: : : +- Union false, false
: : : :- Project [cast(id#25 as decimal(19,2)) AS id#30]
: : : : +- Relation default.t1[id#25] parquet
: : : +- Project [cast(id#26 as decimal(19,2)) AS id#31]
: : : +- Relation default.t2[id#26] parquet
: : +- Project [cast(id#27 as decimal(20,3)) AS id#33]
: : +- Relation default.t3[id#27] parquet
: +- Project [cast(id#28 as decimal(21,4)) AS id#35]
: +- Relation default.t4[id#28] parquet
+- Project [cast(id#29 as decimal(22,5)) AS id#37]
+- Relation default.t5[id#29] parquet
```
After this pr:
```
== Optimized Logical Plan ==
Aggregate [id#36], [id#36]
+- Union false, false
:- Project [cast(id#25 as decimal(22,5)) AS id#36]
: +- Relation default.t1[id#25] parquet
:- Project [cast(id#26 as decimal(22,5)) AS id#46]
: +- Relation default.t2[id#26] parquet
:- Project [cast(id#27 as decimal(22,5)) AS id#45]
: +- Relation default.t3[id#27] parquet
:- Project [cast(id#28 as decimal(22,5)) AS id#44]
: +- Relation default.t4[id#28] parquet
+- Project [cast(id#29 as decimal(22,5)) AS id#37]
+- Relation default.t5[id#29] parquet
```
### Why are the changes needed?
Improve query performance by reduce shuffles.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#35214 from wangyum/SPARK-37915.
Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments