[SPARK-37483][SQL] Support push down top N to JDBC data source V2 #34738

beliefer · 2021-11-29T10:11:44Z

What changes were proposed in this pull request?

Currently, Spark supports push down limit to data source.
However, in the user's scenario, limit must have the premise of order by. Because limit and order by are more valuable together.

On the other hand, push down top N(same as order by ... limit N) outputs the data with basic order to Spark sort, the the sort of Spark may have some performance improvement.

Why are the changes needed?

push down top N is very useful for users scenario.
push down top N could improves the performance of sort.

Does this PR introduce any user-facing change?

'No'. Just change the physical execute.

How was this patch tested?

New tests.

SparkQA · 2021-11-29T10:22:21Z

Test build #145706 has finished for PR 34738 at commit 4872dcc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-29T11:15:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50176/

SparkQA · 2021-11-29T12:14:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50182/

SparkQA · 2021-11-29T12:16:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50176/

SparkQA · 2021-11-29T13:01:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50182/

SparkQA · 2021-11-29T13:19:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50184/

yikf · 2021-11-29T13:21:21Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+        case _ =>
+          child transform {
+            case sort @ Sort(order, _, ScanOperation(_, filter, sHolder: ScanBuilderHolder))
+              if filter.length == 0 =>


How about using filter.isEmpty?

yikf · 2021-11-29T13:29:49Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java

+  /**
+   * Pushes down top N to the data source.
+   */
+  boolean pushTopN(SortValue[] orders, int limit);


There is a little strange that pushTopN with the return value of the Boolean, How about adding two methods like pushTopN and pushedTopN, In this way, the responsibilities of each method are cleaner. FYI

or probably we should have a new interface SupportsPushDownTopN, as Spark can either push down limit, or top n, but not both.

yikf · 2021-11-29T13:32:42Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                sHolder.pushedLimit = Some(limitValue)
+                sHolder.sortValues = orders
+              }
+              sort


The limit with sort has been pushed done, Whether the sort node can be remove?

Some database supports partition sort.

Yea I think we can remove the sort node from the query plan.

SparkQA · 2021-11-29T14:21:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50184/

SparkQA · 2021-11-29T15:46:53Z

Test build #145711 has finished for PR 34738 at commit 90e0250.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-29T16:17:59Z

Test build #145714 has finished for PR 34738 at commit 63d9168.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-11-29T19:38:42Z

cc @huaxingao FYI.

huaxingao · 2021-11-29T23:26:44Z

cc @cloud-fan

cloud-fan · 2021-11-30T04:53:07Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java

@@ -18,6 +18,7 @@
 package org.apache.spark.sql.connector.read;

 import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.expressions.SortValue;


SortValue is private[sql], we shouldn't expose it in the public APIs. We should use SortOrder instead

SparkQA · 2021-11-30T09:09:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50230/

SparkQA · 2021-11-30T10:09:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50230/

SparkQA · 2021-11-30T13:38:30Z

Test build #145758 has finished for PR 34738 at commit 52213ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-12-01T01:38:58Z

ping @cloud-fan

cloud-fan · 2021-12-01T12:30:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -149,6 +149,7 @@ case class RowDataSourceScanExec(
        Map("PushedAggregates" -> seqToString(v.aggregateExpressions),
          "PushedGroupByColumns" -> seqToString(v.groupByColumns))} ++
      pushedDownOperators.limit.map(value => "PushedLimit" -> s"LIMIT $value") ++
+      Map("PushedSortOrders" -> seqToString(pushedDownOperators.sortValues)) ++


can we use pushedTopN and display both sort orders and limit n here?

cloud-fan · 2021-12-01T12:35:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushedDownOperators.scala

@@ -25,4 +26,5 @@ import org.apache.spark.sql.connector.expressions.aggregate.Aggregation
 case class PushedDownOperators(
    aggregation: Option[Aggregation],
    sample: Option[TableSampleInfo],
-    limit: Option[Int])
+    limit: Option[Int],
+    sortValues: Seq[SortOrder])


let's add an assert that, sortValues can only present if limit is present

cloud-fan · 2021-12-01T12:37:53Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

          val limitPushed = PushDownUtils.pushLimit(sHolder.builder, limitValue)
          if (limitPushed) {
            sHolder.pushedLimit = Some(limitValue)
          }
          globalLimit
-        case _ => globalLimit
+        case _ =>
+          child transform {


this looks scary, as we don't know what's between limit and sort, and it's dangerous to push down top N.

I think we need to do an exact match: case Sort(..., ScanOperation...)

cloud-fan · 2021-12-01T12:40:02Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

@@ -370,6 +370,8 @@ abstract class JdbcDialect extends Serializable with Logging{
   */
  def supportsLimit(): Boolean = true

+  def supportsTopN(): Boolean = true


is there any database that do not support limit and sort?

I guess not exists the one.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTopN.java

SparkQA · 2021-12-13T15:55:37Z

Test build #146132 has finished for PR 34738 at commit 082ded9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-14T18:06:20Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

-      }
+      val newChild = pushDownLimit(child, limitValue)
+      val newLocalLimit = globalLimit.child.asInstanceOf[LocalLimit].withNewChildren(Seq(newChild))
+      globalLimit


Suggested change

globalLimit

globalLimit.copy(child = newLocalLimit)

Shall we also update the test to make sure the Sort operator is removed from the query plan?

cloud-fan · 2021-12-14T18:08:07Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

@@ -43,6 +44,7 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel
    .set("spark.sql.catalog.h2.driver", "org.h2.Driver")
    .set("spark.sql.catalog.h2.pushDownAggregate", "true")
    .set("spark.sql.catalog.h2.pushDownLimit", "true")
+    .set("spark.sql.catalog.h2.pushDownTopN", "true")


this is not needed now.

cloud-fan · 2021-12-14T18:09:57Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+    checkAnswer(df5, Seq(Row(1, "cathy", 9000.00, 1200.0), Row(1, "amy", 10000.00, 1000.0)))
+
+    val df6 = spark.read.table("h2.test.employee")
+      .where($"dept" === 1).limit(1)


This has been covered by the limit pushdown tests already.

SparkQA · 2021-12-15T03:16:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50682/

SparkQA · 2021-12-15T04:15:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50682/

cloud-fan · 2021-12-15T05:18:12Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+    checkAnswer(df7, Seq(Row(10000.00, 1000.0, "amy")))
+  }
+
+  private def checkSortRemoved(df: DataFrame, removed: Boolean = true): Unit = {


Can we do this in checkPushedLimit? If sortValues is nonEmpty, we check if the sort has been removed from the query plan.

SparkQA · 2021-12-15T06:58:42Z

Test build #146208 has finished for PR 34738 at commit 4f69ee8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-15T07:22:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50693/

SparkQA · 2021-12-15T08:10:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50693/

SparkQA · 2021-12-15T11:08:46Z

Test build #146219 has finished for PR 34738 at commit eb32095.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-12-15T11:27:34Z

retest this please

SparkQA · 2021-12-15T12:19:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50705/

SparkQA · 2021-12-15T13:05:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50705/

SparkQA · 2021-12-15T16:17:22Z

Test build #146231 has finished for PR 34738 at commit eb32095.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-15T16:24:28Z

thanks, merging to master!

dongjoon-hyun · 2021-12-15T20:04:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -726,6 +726,21 @@ object DataSourceStrategy
    }
  }

+  protected[sql] def translateSortOrders(sortOrders: Seq[SortOrder]): Seq[SortOrderV2] = {
+    sortOrders.map {


Hi, All. This broke Scala 2.13 compilation.

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:730:20: match may not be exhaustive. [error] It would fail on the following input: SortOrder(_, _, _, _) [error] sortOrders.map { [error] ^ [warn] 24 warnings found [error] one error found [error] (sql / Compile / compileIncremental) Compilation failed [error] Total time: 267 s (04:27), completed Dec 15, 2021 5:57:25 AM

dongjoon-hyun · 2021-12-15T20:13:38Z

Could you make a follow-up to recover Scala 2.13, @beliefer ?

dongjoon-hyun · 2021-12-15T20:17:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -726,6 +726,21 @@ object DataSourceStrategy
    }
  }

+  protected[sql] def translateSortOrders(sortOrders: Seq[SortOrder]): Seq[SortOrderV2] = {
+    sortOrders.map {
+      case SortOrder(PushableColumnWithoutNestedColumn(name), directionV1, nullOrderingV1, _) =>


In addition, this will cause scala.MatchError in Scala 2.12. We need a new test case which is not matched this case, @beliefer .

HyukjinKwon · 2021-12-16T00:13:19Z

Im gonna revert this because it did not pass the tests.

dongjoon-hyun · 2021-12-16T00:37:55Z

Thank you for the decision, @HyukjinKwon !

beliefer · 2021-12-16T01:15:07Z

Im gonna revert this because it did not pass the tests.

Could I open another PR to finish this issue ? @dongjoon-hyun @HyukjinKwon cc @cloud-fan

HyukjinKwon · 2021-12-16T01:51:32Z

Yeah, sure. Please go ahead. Thanks @beliefer .

beliefer · 2021-12-16T01:58:28Z

Yeah, sure. Please go ahead. Thanks @beliefer .

Thank you.

cloud-fan · 2021-12-16T04:53:43Z

@beliefer your github action environment seems broken and I can hardly see a successful run in your github action in various PRs, and I have to rely on the Jenkins results. Maybe let's take this chance to fix your github action, please ping us in your re-submit PR and we can help you investigate if the github action consistently fails with unknown reasons.

beliefer · 2021-12-16T05:20:16Z

@beliefer your github action environment seems broken and I can hardly see a successful run in your github action in various PRs, and I have to rely on the Jenkins results. Maybe let's take this chance to fix your github action, please ping us in your re-submit PR and we can help you investigate if the github action consistently fails with unknown reasons.

@cloud-fan Thank you help me fix the issue of github action environment.

[SPARK-37483][SQL] Support pushdown down top N to JDBC data source V2

4872dcc

github-actions bot added the SQL label Nov 29, 2021

beliefer added 2 commits November 29, 2021 18:35

Update code

90e0250

Update code

63d9168

yikf reviewed Nov 29, 2021

View reviewed changes

beliefer changed the title ~~[SPARK-37483][SQL] Support pushdown down top N to JDBC data source V2~~ [SPARK-37483][SQL] Support push down top N to JDBC data source V2 Nov 30, 2021

cloud-fan reviewed Nov 30, 2021

View reviewed changes

beliefer added 3 commits November 30, 2021 15:56

Update code

8abbb34

Update code

39eab85

Update code

52213ed

cloud-fan reviewed Dec 1, 2021

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTopN.java Show resolved Hide resolved

cloud-fan reviewed Dec 14, 2021

View reviewed changes

Update code

4f69ee8

cloud-fan reviewed Dec 15, 2021

View reviewed changes

Update code

eb32095

cloud-fan closed this in a5a0e82 Dec 15, 2021

dongjoon-hyun reviewed Dec 15, 2021

View reviewed changes

beliefer mentioned this pull request Dec 16, 2021

[SPARK-37483][SQL] Support push down top N to JDBC data source V2 #34918

Closed

[SPARK-37483][SQL] Support push down top N to JDBC data source V2 #34738

[SPARK-37483][SQL] Support push down top N to JDBC data source V2 #34738

Conversation

beliefer commented Nov 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

SparkQA commented Nov 29, 2021

c21 commented Nov 29, 2021

huaxingao commented Nov 29, 2021

Choose a reason for hiding this comment

SparkQA commented Nov 30, 2021

SparkQA commented Nov 30, 2021

SparkQA commented Nov 30, 2021

beliefer commented Dec 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

beliefer commented Dec 15, 2021

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

SparkQA commented Dec 15, 2021

cloud-fan commented Dec 15, 2021

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 15, 2021

dongjoon-hyun Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Dec 16, 2021

dongjoon-hyun commented Dec 16, 2021

beliefer commented Dec 16, 2021

HyukjinKwon commented Dec 16, 2021

beliefer commented Dec 16, 2021

cloud-fan commented Dec 16, 2021 • edited Loading

beliefer commented Dec 16, 2021

dongjoon-hyun Dec 15, 2021 •

edited

Loading

cloud-fan commented Dec 16, 2021 •

edited

Loading