-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a consumer lag as metric via a periodic task in controller #9800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to emit this from the controller? We can aggregate the metrics that is being emitted from the servers, right?
cc: @sajjad-moradi
Codecov Report
@@ Coverage Diff @@
## master #9800 +/- ##
=============================================
+ Coverage 24.55% 70.17% +45.61%
- Complexity 53 5000 +4947
=============================================
Files 1952 1965 +13
Lines 104676 105105 +429
Branches 15856 15904 +48
=============================================
+ Hits 25700 73753 +48053
+ Misses 76347 26211 -50136
- Partials 2629 5141 +2512
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
We could emit metrics from servers and then, try to compute in the monitoring layer. This is hard as we would have to find the max lag among all replicas that ever existed for a given partition. I am not familiar with a way to find the max value among the current replica set for a given partition. Moreover, when a table is rebalanced, the consuming segments get moved around. This can lead to prolonged stale value that will mostly cause noise. We can make this an opt-in periodic task. I added it as a separate task to have better control over the frequency of this task. |
When a table is rebalanced, each server can get notified and if they no longer serve a partition, they can remove the corresponding gauge metric. |
This is a useful metric to have. Ideally we'd want the gauge metric to be updated pretty frequently, like in matter of one minute. I'm not sure running the periodic task every minute or so is a good idea! |
I believe we do invoke the code to remove a metric each time a partition completes consumption. |
Agreed. We should be emitting this metric every few minutes so as to detect lags quickly and act on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. LGTM otherwise.
pinot-controller/src/main/java/org/apache/pinot/controller/helix/RealtimeConsumerMonitor.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/helix/RealtimeConsumerMonitor.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/helix/RealtimeConsumerMonitor.java
Show resolved
Hide resolved
I see that this has already been approved for merge. We intend to submit a PR soon that will handle this at server level, since we need alerting sooner than later on lag. If you choose to, you can wait for that PR before merging this. |
Agreed here :) We will likely not set it to query every minute.
Agree that we can detect it sooner. but there doesn't seem to be a good way to aggregate it in the monitoring layer in the presence of rebalance (clean/unclean) or consuming segment re-distribution for any other reason. A much cleaner way would be to emit at partition level from the connector plugin directly or from server (without involving the server tag, but a stable replica id tag). I believe there are some dependency issues to be sorted out before getting there.
This works well in a stable state and clean operations. But this doesn't cover cases of unclean shutdown / crashes in production and it has generally been observed to be not very reliable. |
Just for my understanding, can you elaborate what you mean by unclean shutdown? |
Say when the server crashed for whatever reason (maybe memory) while the user was rebalancing the table, and moved away the consuming segments to a different server? |
Summarizing the discussion (or re-discussion) with @mayankshriv / @snleee / @npawar :
Here is the plan of action:
I would like to keep both options open for use in production so that we can observe how these metrics/handlers workout under various scenarios. @mcvsubbu if Linkedin is also working on the lag metrics, can you please share the design and the timeline for this ? I want to make sure design aligns and works with existing OSS apis. |
[I thought all periodic tasks are opt-in. Just not configure it at all, or set the time inteval to 0 or somethig like that?] Anyways, yes. we are going to work on it, the timeline is next few weeks. No design doc but what we will be doing is:
I will let Juan add more details once he has it (or just a PR) |
Description
This PR enables Pinot to publish consumer lag as a metric for realtime tables. It is emitted via a periodic task in the controller that will periodically query
/consumingSegmentsInfo
API and record the max consuming lag among the partition's replicas.Currently, it publishes the following metrics:
MAX_RECORDS_LAG
MAX_AVAILABILITY_LAG_MS
The task can be configured to run at a given frequency so as to not overwhelm the server (and through that, not to overwhelm the data source).
Labels:
observability
Release Notes
MAX_RECORDS_LAG
andMAX_AVAILABILITY_LAG_MS