Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul pageserver/src/metrics.rs #4813

Open
koivunej opened this issue Jul 26, 2023 · 2 comments
Open

Overhaul pageserver/src/metrics.rs #4813

koivunej opened this issue Jul 26, 2023 · 2 comments
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

Refactor the metrics into a hierarchy, where in the root there is a PageserverMetrics. The default one is stored in a Lazy as are currently all metrics.

TimelineMetrics will need to be created out of &PageserverMetrics (no 'static because the metrics are Arcs).

With these kind of changes, we will be able to unit test metrics without spinning up a new process as long as each TenantHarness has their own PageserverMetrics.

Pre-initializing the metrics will be a one line in fn main(): let metrics = &crate::metrics::METRICS_ROOT;. Final step is to replace the previous with let metrics = Box::new(crate::metrics::PageserverMetrics::new()).leak(); without any statics.

@koivunej koivunej added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt m/good_first_issue Moment: when doing your first Neon contributions and removed m/good_first_issue Moment: when doing your first Neon contributions labels Jul 26, 2023
@koivunej
Copy link
Member Author

Dropping the good_first_issue: PageserverMetrics needs to be a registry in this idea. There are probably dragons hidden in this approach I cannot forsee. Otherwise we could just have the metrics registered near the operations.

@koivunej
Copy link
Member Author

koivunej commented Aug 3, 2023

Trialing this approach in #4892, I think it's looking quite good.

koivunej added a commit that referenced this issue Aug 4, 2023
We don't know how our s3 remote_storage is performing, or if it's
blocking the shutdown. Well, for sampling reasons, we will not really
know even after this PR.

Add metrics:
- align remote_storage metrics towards #4813 goals
- histogram
`remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects),
result=(ok|err|cancelled)}`
- histogram `remote_storage_s3_wait_seconds{request_type=(same kinds)}`
- counter `remote_storage_s3_cancelled_waits_total{request_type=(same
kinds)}`

Follow-up work:
- After release, remove the old metrics, migrate dashboards

Histogram buckets are rough guesses, need to be tuned. In pageserver we
have a download timeout of 120s, so I think the 100s bucket is quite
nice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant