-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote storage metrics #4892
Remote storage metrics #4892
Conversation
4c6e53a
to
f00f900
Compare
1264 tests run: 1213 passed, 0 failed, 51 skipped (full report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Have you tested it locally? Not sure its easy to construct a test case that exercises failures. There should be a case when we check for index part and get 404, this can be an easy one to add a check (sadly dont remember what is the test name)
I was hoping to test it on staging, but I'll try to find that test case later. |
I did manual checking on that particular test case you found
Also realized that the |
f24181c
to
5e82f26
Compare
this happens by adding a yet another IntCounterVec called `remote_storage_s3_cancelled_waits_total` to track how many cancellations happen while waiting for semaphore, and a new dimension for `remote_storage_s3_request_seconds` for requests which end up cancelled.
5e82f26
to
3172d0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor nits
WDYT if we split s3_bucket.rs to move metrics out of it? Metrics now takes half of it (excluding tests)
We don't know how our s3 remote_storage is performing, or if it's blocking the shutdown. Well, for sampling reasons, we will not really know even after this PR.
Add metrics:
remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects), result=(ok|err|cancelled)}
remote_storage_s3_wait_seconds{request_type=(same kinds)}
remote_storage_s3_cancelled_waits_total{request_type=(same kinds)}
Follow-up work:
Histogram buckets are rough guesses, need to be tuned. In pageserver we have a download timeout of 120s, so I think the 100s bucket is quite nice.