-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
emit minion task generation time and error metrics #10026
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10026 +/- ##
=============================================
+ Coverage 32.54% 70.37% +37.82%
- Complexity 200 5692 +5492
=============================================
Files 1992 1994 +2
Lines 107339 107475 +136
Branches 16318 16330 +12
=============================================
+ Hits 34935 75633 +40698
+ Misses 69355 26558 -42797
- Partials 3049 5284 +2235
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -53,6 +53,8 @@ public enum ControllerGauge implements AbstractMetrics.Gauge { | |||
DISABLED_TABLE_COUNT("TableCount", true), | |||
PERIODIC_TASK_NUM_TABLES_PROCESSED("PeriodicTaskNumTablesProcessed", true), | |||
TIME_MS_SINCE_LAST_MINION_TASK_METADATA_UPDATE("TimeMsSinceLastMinionTaskMetadataUpdate", false), | |||
TIME_MS_SINCE_LAST_SUCCESSFUL_MINION_TASK_GENERATION("TimeMsSinceLastSuccessfulMinionTaskGeneration", false), | |||
LAST_MINION_TASK_GENERATION_ENCOUNTERS_ERROR("LastMinionTaskGenerationEncountersError", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the main goal of this event? We are trying to indicate that at certain time we had the issue with the task generation? I think that many existing error related metrics (query error, llc commit error etc) are being emitted as meter. Have you considered both?
I think that the main difference is that
- Gauge will show 1 from the time that we had the error until the next successful run.
- Meter will show one spike in the graph and each spike will indicate the unsuccessful run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are trying to indicate we had issues with task generation.
We use gauge here because
- Normally, task generation is not a frequent event, we don't need to care about rate
- We want to keep the error until the next successful run so that we know the problem has not been fixed. A successful run can clear out error history to indicate the problem does not exist any more.
@@ -541,20 +541,35 @@ private String scheduleTask(PinotTaskGenerator taskGenerator, List<TableConfig> | |||
generateTasks() return a list of TaskGeneratorMostRecentRunInfo for each table | |||
*/ | |||
pinotTaskConfigs = taskGenerator.generateTasks(enabledTableConfigs); | |||
long successRunTs = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Ts
stands for? TaskSchedule
? Let's just use the full name for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ts
stands for timestamp
. let me change it to the full name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
@zhtaoxiang Can you briefly take a look into |
Report (1) time ms since last successful minion task generation (2) whether last minion task generation encounters error as metrics. They are helpful when we are going to track/debug minion tasks.