fix: Sort on single struct should fallback to Spark #811

viirya · 2024-08-11T23:13:00Z

Which issue does this PR close?

Closes #807.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter · 2024-08-12T00:04:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.80%. Comparing base (4fe43ad) to head (44855de).

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #811      +/-   ##
============================================
- Coverage     33.94%   33.80%   -0.14%     
+ Complexity      874      870       -4     
============================================
  Files           112      112              
  Lines         42916    42914       -2     
  Branches       9464     9452      -12     
============================================
- Hits          14567    14507      -60     
- Misses        25379    25428      +49     
- Partials       2970     2979       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huaxingao · 2024-08-12T05:45:21Z

docs/source/user-guide/configs.md

 | spark.comet.scan.enabled | Whether to enable Comet scan. When this is turned on, Spark will use Comet to read Parquet data source. Note that to enable native vectorized execution, both this config and 'spark.comet.exec.enabled' need to be enabled. By default, this config is true. | true |
 | spark.comet.scan.preFetch.enabled | Whether to enable pre-fetching feature of CometScan. By default is disabled. | false |
 | spark.comet.scan.preFetch.threadNum | The number of threads running pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is enabled. By default it is 2. Note that more pre-fetching threads means more memory requirement to store pre-fetched row groups. | 2 |
 | spark.comet.shuffle.preferDictionary.ratio | The ratio of total values to distinct values in a string column to decide whether to prefer dictionary encoding when shuffling the column. If the ratio is higher than this config, dictionary encoding will be used on shuffling string column. This config is effective if it is higher than 1.0. By default, this config is 10.0. Note that this config is only used when `spark.comet.exec.shuffle.mode` is `jvm`. | 10.0 |
+| spark.comet.sparkToColumnar.supportedOperatorList | A comma-separated list of operators that will be converted to Comet columnar format when 'spark.comet.sparkToColumnar.enabled' is true | Range,InMemoryTableScan |


nit: Shall we use ` instead of '

This is not changed by this PR. I think there is previous PR changing it, but didn't update the document.

The document is updated automatically when make release locally.

viirya · 2024-08-12T05:48:03Z

Thanks @huaxingao

andygrove · 2024-08-12T13:22:02Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2501,6 +2501,13 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim

      case SortExec(sortOrder, _, child, _)
          if isCometOperatorEnabled(op.conf, CometConf.OPERATOR_SORT) =>
+        // TODO: Remove this constraint when we upgrade to new arrow-rs including
+        // https://github.com/apache/arrow-rs/pull/6225
+        if (child.output.length == 1 && child.output.head.dataType.isInstanceOf[StructType]) {


As we add support for other types, do we need to update this to make it recursive so that we check for Map or Array containing struct?

Let me add more data types here according to arrow-rs.

(cherry picked from commit 071c780)

fix: Sort on single struct should fallback to Spark

44855de

viirya requested review from kazuyukitanimura, andygrove and huaxingao August 12, 2024 02:21

huaxingao reviewed Aug 12, 2024

View reviewed changes

huaxingao approved these changes Aug 12, 2024

View reviewed changes

viirya merged commit 071c780 into apache:main Aug 12, 2024
75 checks passed

viirya deleted the fix_sort branch August 12, 2024 05:48

andygrove reviewed Aug 12, 2024

View reviewed changes

This was referenced Aug 13, 2024

Check sort order of SortExec instead of child output #822

Closed

fix: Check sort order of SortExec instead of child output #821

Merged

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

fix: Sort on single struct should fallback to Spark (apache#811)

8452ef8

(cherry picked from commit 071c780)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Sort on single struct should fallback to Spark #811

fix: Sort on single struct should fallback to Spark #811

viirya commented Aug 11, 2024

codecov-commenter commented Aug 12, 2024

huaxingao Aug 12, 2024

viirya Aug 12, 2024

viirya Aug 12, 2024

viirya commented Aug 12, 2024

andygrove Aug 12, 2024

viirya Aug 12, 2024

fix: Sort on single struct should fallback to Spark #811

fix: Sort on single struct should fallback to Spark #811

Conversation

viirya commented Aug 11, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter commented Aug 12, 2024

Codecov Report

huaxingao Aug 12, 2024

Choose a reason for hiding this comment

viirya Aug 12, 2024

Choose a reason for hiding this comment

viirya Aug 12, 2024

Choose a reason for hiding this comment

viirya commented Aug 12, 2024

andygrove Aug 12, 2024

Choose a reason for hiding this comment

viirya Aug 12, 2024

Choose a reason for hiding this comment