-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pageserver SLO for startup performance: tenant load & time-to-active #4083
Comments
Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first. |
Suspected bottlenecks right now:
Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken". I think this is achievable.
Obviously, we won't add alerts which we know we'll break. |
It sounds relevant |
Relevant to what? To your
or generally relevant? |
generally, to start with it (about the SLO) |
This is related to #4025. |
I am eager to see the distribution of these activations, then I can comment more on if that makes sense as an SLO. |
Edited the description to include alerting on contiguous |
It's a really slow CI day and I am eager to test unrelated code in staging.
Can this be implemented in promql? Later remembered: Tenant activations which happen as a result of creation will be instant, because there is no other load. I at least wouldn't want them on the same histogram because then it will say "some large percentile is very fast", even if on restart init activations would take 130s. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Closed wrong issue. |
I went ahead and added the metrics for this in #4893 |
We should have a pageserver-level SLO for the time it takes until all tenants of the pageserver have reached state "Active" or "Broken".
This can be broken down into two metrics:
What to do with the metrics:
1
-time of the gauge not exceeding a thresholdRelated:
The text was updated successfully, but these errors were encountered: