emit minion task generation time and error metrics #10026

zhtaoxiang · 2022-12-23T00:28:58Z

Report (1) time ms since last successful minion task generation (2) whether last minion task generation encounters error as metrics. They are helpful when we are going to track/debug minion tasks.

codecov-commenter · 2022-12-23T01:27:18Z

Codecov Report

Merging #10026 (0b34a17) into master (36c82b6) will increase coverage by 37.82%.
The diff coverage is 62.50%.

@@              Coverage Diff              @@
##             master   #10026       +/-   ##
=============================================
+ Coverage     32.54%   70.37%   +37.82%     
- Complexity      200     5692     +5492     
=============================================
  Files          1992     1994        +2     
  Lines        107339   107475      +136     
  Branches      16318    16330       +12     
=============================================
+ Hits          34935    75633    +40698     
+ Misses        69355    26558    -42797     
- Partials       3049     5284     +2235

Flag	Coverage Δ
integration1	`24.99% <62.50%> (?)`
integration2	`24.29% <62.50%> (+0.06%)`	⬆️
unittests1	`67.94% <0.00%> (?)`
unittests2	`13.60% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...controller/helix/core/minion/PinotTaskManager.java	`67.61% <57.14%> (+11.30%)`	⬆️
...g/apache/pinot/common/metrics/ControllerGauge.java	`98.14% <100.00%> (+0.07%)`	⬆️
...transform/function/IsNotNullTransformFunction.java	`65.51% <0.00%> (-3.45%)`	⬇️
...re/accounting/PerQueryCPUMemAccountantFactory.java	`66.77% <0.00%> (-2.38%)`	⬇️
...manager/realtime/HLRealtimeSegmentDataManager.java	`77.98% <0.00%> (-2.30%)`	⬇️
...not/broker/broker/helix/ClusterChangeMediator.java	`75.26% <0.00%> (-2.16%)`	⬇️
...pache/pinot/core/query/utils/idset/EmptyIdSet.java	`25.00% <0.00%> (ø)`
...ot/common/function/scalar/ComparisonFunctions.java	`42.85% <0.00%> (ø)`
...anager/realtime/SegmentBuildTimeLeaseExtender.java	`63.23% <0.00%> (ø)`
.../segment/local/customobject/PinotFourthMoment.java	`96.49% <0.00%> (ø)`
... and 1234 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

snleee · 2022-12-23T01:47:56Z

pinot-common/src/main/java/org/apache/pinot/common/metrics/ControllerGauge.java

@@ -53,6 +53,8 @@ public enum ControllerGauge implements AbstractMetrics.Gauge {
  DISABLED_TABLE_COUNT("TableCount", true),
  PERIODIC_TASK_NUM_TABLES_PROCESSED("PeriodicTaskNumTablesProcessed", true),
  TIME_MS_SINCE_LAST_MINION_TASK_METADATA_UPDATE("TimeMsSinceLastMinionTaskMetadataUpdate", false),
+  TIME_MS_SINCE_LAST_SUCCESSFUL_MINION_TASK_GENERATION("TimeMsSinceLastSuccessfulMinionTaskGeneration", false),
+  LAST_MINION_TASK_GENERATION_ENCOUNTERS_ERROR("LastMinionTaskGenerationEncountersError", false),


What's the main goal of this event? We are trying to indicate that at certain time we had the issue with the task generation? I think that many existing error related metrics (query error, llc commit error etc) are being emitted as meter. Have you considered both?

I think that the main difference is that

Gauge will show 1 from the time that we had the error until the next successful run.

Meter will show one spike in the graph and each spike will indicate the unsuccessful run.

Yes, we are trying to indicate we had issues with task generation.

We use gauge here because

Normally, task generation is not a frequent event, we don't need to care about rate

We want to keep the error until the next successful run so that we know the problem has not been fixed. A successful run can clear out error history to indicate the problem does not exist any more.

snleee · 2022-12-23T01:52:13Z

...controller/src/main/java/org/apache/pinot/controller/helix/core/minion/PinotTaskManager.java

@@ -541,20 +541,35 @@ private String scheduleTask(PinotTaskGenerator taskGenerator, List<TableConfig>
        generateTasks() return a list of TaskGeneratorMostRecentRunInfo for each table
       */
      pinotTaskConfigs = taskGenerator.generateTasks(enabledTableConfigs);
+      long successRunTs = System.currentTimeMillis();


What does Ts stands for? TaskSchedule? Let's just use the full name for clarity.

Ts stands for timestamp. let me change it to the full name.

snleee

LGTM thanks!

snleee · 2022-12-23T22:28:29Z

@zhtaoxiang Can you briefly take a look into Trivy? I haven't seen this before.

emit minion task generation time and error metrics

bd72605

snleee reviewed Dec 23, 2022

View reviewed changes

address comments

0b34a17

snleee approved these changes Dec 23, 2022

View reviewed changes

snleee merged commit 880a5c7 into apache:master Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emit minion task generation time and error metrics #10026

emit minion task generation time and error metrics #10026

zhtaoxiang commented Dec 23, 2022

codecov-commenter commented Dec 23, 2022 •

edited

Loading

snleee Dec 23, 2022

zhtaoxiang Dec 23, 2022

snleee Dec 23, 2022

zhtaoxiang Dec 23, 2022

snleee left a comment

snleee commented Dec 23, 2022

emit minion task generation time and error metrics #10026

emit minion task generation time and error metrics #10026

Conversation

zhtaoxiang commented Dec 23, 2022

codecov-commenter commented Dec 23, 2022 • edited Loading

Codecov Report

snleee Dec 23, 2022

Choose a reason for hiding this comment

zhtaoxiang Dec 23, 2022

Choose a reason for hiding this comment

snleee Dec 23, 2022

Choose a reason for hiding this comment

zhtaoxiang Dec 23, 2022

Choose a reason for hiding this comment

snleee left a comment

Choose a reason for hiding this comment

snleee commented Dec 23, 2022

codecov-commenter commented Dec 23, 2022 •

edited

Loading