Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pageserver SLO for startup performance: tenant load & time-to-active #4083

Open
problame opened this issue Apr 26, 2023 · 12 comments
Open
Labels
a/observability Area: related to observability c/storage/pageserver Component: storage: pageserver

Comments

@problame
Copy link
Contributor

problame commented Apr 26, 2023

We should have a pageserver-level SLO for the time it takes until all tenants of the pageserver have reached state "Active" or "Broken".

This can be broken down into two metrics:

  1. one gauge metric that is 1 exactly while the tenant loads initiated by tenant::mgr::init are going on
  2. a global histogram that tracks time-to-active

What to do with the metrics:

  • We can then multiply the histogram with the gauge and alert on outliers.
  • Also, we can alert on the contiguous 1-time of the gauge not exceeding a threshold

Related:

@problame problame added c/storage/pageserver Component: storage: pageserver a/observability Area: related to observability labels Apr 26, 2023
@vadim2404
Copy link
Contributor

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

@problame
Copy link
Contributor Author

Does this metric depend on tenant size or any other thing?

Suspected bottlenecks right now:

  • get remote index_part.json's
    • network latency is dominant here
    • concurrency limiter says hello as well
  • building the layer maps
    • this is CPU-bound

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

I think this is achievable.

Because for SLO, it makes sense to remove the "noise" first.

Obviously, we won't add alerts which we know we'll break.
We'll add the metric, create a dashboard, measure, understand, fix first.

@vadim2404
Copy link
Contributor

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

It sounds relevant

@problame
Copy link
Contributor Author

Relevant to what? To your

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

or generally relevant?

@vadim2404
Copy link
Contributor

generally, to start with it (about the SLO)

@koivunej
Copy link
Member

This is related to #4025.

@koivunej
Copy link
Member

I am eager to see the distribution of these activations, then I can comment more on if that makes sense as an SLO.

@problame
Copy link
Contributor Author

Edited the description to include alerting on contiguous 1-time of the gauge.

@koivunej
Copy link
Member

koivunej commented May 29, 2023

It's a really slow CI day and I am eager to test unrelated code in staging. Might as well hack these two because I created the initial load time watching already in e879d6c.

Also, we can alert on the contiguous 1-time of the gauge not exceeding a threshold

Can this be implemented in promql?

Later remembered: Tenant activations which happen as a result of creation will be instant, because there is no other load. I at least wouldn't want them on the same histogram because then it will say "some large percentile is very fast", even if on restart init activations would take 130s.

@koivunej

This comment was marked as off-topic.

@koivunej koivunej closed this as completed Aug 4, 2023
@koivunej koivunej reopened this Aug 4, 2023
@koivunej
Copy link
Member

koivunej commented Aug 4, 2023

Closed wrong issue.

@jcsp
Copy link
Contributor

jcsp commented Aug 4, 2023

I went ahead and added the metrics for this in #4893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/observability Area: related to observability c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

4 participants