perf: improve performance of update metrics #1329

wForget · 2025-01-23T03:43:52Z

Which issue does this PR close?

Closes #1328.

Rationale for this change

Improve performance of update metrics

What changes are included in this PR?

Define a NativeMetricNode proto type to pass all metric nodes at once to avoid iterative jni calls.
Call update metrics when releasing plan to reduce the number of calls.

How are these changes tested?

after this

sql metrics are displayed correctly:

cpu profile:

codecov-commenter · 2025-01-23T05:12:30Z

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.92%. Comparing base (f09f8af) to head (958476b).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...a/org/apache/spark/sql/comet/CometMetricNode.scala	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1329       +/-   ##
=============================================
- Coverage     56.12%   33.92%   -22.20%     
- Complexity      976      983        +7     
=============================================
  Files           119      125        +6     
  Lines         11743    48515    +36772     
  Branches       2251    10628     +8377     
=============================================
+ Hits           6591    16460     +9869     
- Misses         4012    28725    +24713     
- Partials       1140     3330     +2190

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2025-01-23T06:07:52Z

Although the proportion of udpate metric in cpu profile has been greatly reduced, the tpcds/tpch benchmark of small data set has not been improved.

andygrove · 2025-01-23T14:37:42Z

native/core/src/execution/metrics/utils.rs

-}
+    // add children
+    spark_plan.children().iter().for_each(|child_plan| {
+        let child_node = to_native_metric_node(child_plan).unwrap();


Can we avoid the unwrap here?

Suggested change

let child_node = to_native_metric_node(child_plan).unwrap();

let child_node = to_native_metric_node(child_plan)?;

andygrove · 2025-01-23T14:38:55Z

@mbutrovich may be interested in reviewing this as well

andygrove · 2025-01-23T14:39:49Z

native/core/src/execution/jni_api.rs

@@ -508,9 +505,6 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan(
            let next_item = exec_context.stream.as_mut().unwrap().next();
            let poll_output = exec_context.runtime.block_on(async { poll!(next_item) });

-            // Update metrics
-            update_metrics(&mut env, exec_context)?;


I wonder if we should add a config so that we can choose between frequent metrics updates vs just updating once the query completes. It can sometimes be helpful to see live metrics.

Per-batch is probably always overkill. For long-running jobs is there a period that makes sense? It looks like Spark History defaults to 10s.

I do like the idea of updating metrics every N seconds

I think checking a coarse-grained clock (i.e., CLOCK_MONOTONIC_COARSE) to see if N seconds has elapsed to produce updated metrics would be a reasonable compromise on performance impact vs. fresh metrics.

andygrove · 2025-01-23T14:55:46Z

Based on a single run of TPC-H @ 100GB, I see approximately 2% improvement in TPC-H (325s on main vs 318s with this PR)

wForget changed the title ~~Improve performance of update metrics~~ perf: improve performance of update metrics Jan 23, 2025

wForget added 2 commits January 23, 2025 12:07

Improve performance of update metrics

e2c0178

fix style

958476b

wForget force-pushed the COMET-1328 branch from 590fb65 to 958476b Compare January 23, 2025 04:08

fix

8c5724d

wForget force-pushed the COMET-1328 branch from a5df4f1 to 8c5724d Compare January 23, 2025 05:34

fix

642c737

andygrove reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve performance of update metrics #1329

perf: improve performance of update metrics #1329

wForget commented Jan 23, 2025 •

edited

Loading

codecov-commenter commented Jan 23, 2025

wForget commented Jan 23, 2025

andygrove Jan 23, 2025

andygrove commented Jan 23, 2025

andygrove Jan 23, 2025

mbutrovich Jan 23, 2025

andygrove Jan 23, 2025

mbutrovich Jan 23, 2025

andygrove commented Jan 23, 2025 •

edited

Loading

	let child_node = to_native_metric_node(child_plan).unwrap();
	let child_node = to_native_metric_node(child_plan)?;

perf: improve performance of update metrics #1329

Are you sure you want to change the base?

perf: improve performance of update metrics #1329

Conversation

wForget commented Jan 23, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

after this

codecov-commenter commented Jan 23, 2025

Codecov Report

wForget commented Jan 23, 2025

andygrove Jan 23, 2025

Choose a reason for hiding this comment

andygrove commented Jan 23, 2025

andygrove Jan 23, 2025

Choose a reason for hiding this comment

mbutrovich Jan 23, 2025

Choose a reason for hiding this comment

andygrove Jan 23, 2025

Choose a reason for hiding this comment

mbutrovich Jan 23, 2025

Choose a reason for hiding this comment

andygrove commented Jan 23, 2025 • edited Loading

wForget commented Jan 23, 2025 •

edited

Loading

andygrove commented Jan 23, 2025 •

edited

Loading