Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emit minion task generation time and error metrics #10026

Merged
merged 2 commits into from
Dec 23, 2022

Conversation

zhtaoxiang
Copy link
Contributor

Report (1) time ms since last successful minion task generation (2) whether last minion task generation encounters error as metrics. They are helpful when we are going to track/debug minion tasks.

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2022

Codecov Report

Merging #10026 (0b34a17) into master (36c82b6) will increase coverage by 37.82%.
The diff coverage is 62.50%.

@@              Coverage Diff              @@
##             master   #10026       +/-   ##
=============================================
+ Coverage     32.54%   70.37%   +37.82%     
- Complexity      200     5692     +5492     
=============================================
  Files          1992     1994        +2     
  Lines        107339   107475      +136     
  Branches      16318    16330       +12     
=============================================
+ Hits          34935    75633    +40698     
+ Misses        69355    26558    -42797     
- Partials       3049     5284     +2235     
Flag Coverage Δ
integration1 24.99% <62.50%> (?)
integration2 24.29% <62.50%> (+0.06%) ⬆️
unittests1 67.94% <0.00%> (?)
unittests2 13.60% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...controller/helix/core/minion/PinotTaskManager.java 67.61% <57.14%> (+11.30%) ⬆️
...g/apache/pinot/common/metrics/ControllerGauge.java 98.14% <100.00%> (+0.07%) ⬆️
...transform/function/IsNotNullTransformFunction.java 65.51% <0.00%> (-3.45%) ⬇️
...re/accounting/PerQueryCPUMemAccountantFactory.java 66.77% <0.00%> (-2.38%) ⬇️
...manager/realtime/HLRealtimeSegmentDataManager.java 77.98% <0.00%> (-2.30%) ⬇️
...not/broker/broker/helix/ClusterChangeMediator.java 75.26% <0.00%> (-2.16%) ⬇️
...pache/pinot/core/query/utils/idset/EmptyIdSet.java 25.00% <0.00%> (ø)
...ot/common/function/scalar/ComparisonFunctions.java 42.85% <0.00%> (ø)
...anager/realtime/SegmentBuildTimeLeaseExtender.java 63.23% <0.00%> (ø)
.../segment/local/customobject/PinotFourthMoment.java 96.49% <0.00%> (ø)
... and 1234 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -53,6 +53,8 @@ public enum ControllerGauge implements AbstractMetrics.Gauge {
DISABLED_TABLE_COUNT("TableCount", true),
PERIODIC_TASK_NUM_TABLES_PROCESSED("PeriodicTaskNumTablesProcessed", true),
TIME_MS_SINCE_LAST_MINION_TASK_METADATA_UPDATE("TimeMsSinceLastMinionTaskMetadataUpdate", false),
TIME_MS_SINCE_LAST_SUCCESSFUL_MINION_TASK_GENERATION("TimeMsSinceLastSuccessfulMinionTaskGeneration", false),
LAST_MINION_TASK_GENERATION_ENCOUNTERS_ERROR("LastMinionTaskGenerationEncountersError", false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the main goal of this event? We are trying to indicate that at certain time we had the issue with the task generation? I think that many existing error related metrics (query error, llc commit error etc) are being emitted as meter. Have you considered both?

I think that the main difference is that

  • Gauge will show 1 from the time that we had the error until the next successful run.
  • Meter will show one spike in the graph and each spike will indicate the unsuccessful run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are trying to indicate we had issues with task generation.

We use gauge here because

  1. Normally, task generation is not a frequent event, we don't need to care about rate
  2. We want to keep the error until the next successful run so that we know the problem has not been fixed. A successful run can clear out error history to indicate the problem does not exist any more.

@@ -541,20 +541,35 @@ private String scheduleTask(PinotTaskGenerator taskGenerator, List<TableConfig>
generateTasks() return a list of TaskGeneratorMostRecentRunInfo for each table
*/
pinotTaskConfigs = taskGenerator.generateTasks(enabledTableConfigs);
long successRunTs = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Ts stands for? TaskSchedule? Let's just use the full name for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ts stands for timestamp. let me change it to the full name.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@snleee
Copy link
Contributor

snleee commented Dec 23, 2022

@zhtaoxiang Can you briefly take a look into Trivy? I haven't seen this before.

@snleee snleee merged commit 880a5c7 into apache:master Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants