Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote storage metrics #4892

Merged
merged 15 commits into from
Aug 4, 2023
Merged

Remote storage metrics #4892

merged 15 commits into from
Aug 4, 2023

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Aug 3, 2023

We don't know how our s3 remote_storage is performing, or if it's blocking the shutdown. Well, for sampling reasons, we will not really know even after this PR.

Add metrics:

  • align remote_storage metrics towards Overhaul pageserver/src/metrics.rs #4813 goals
  • histogram remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects), result=(ok|err|cancelled)}
  • histogram remote_storage_s3_wait_seconds{request_type=(same kinds)}
  • counter remote_storage_s3_cancelled_waits_total{request_type=(same kinds)}

Follow-up work:

  • After release, remove the old metrics, migrate dashboards

Histogram buckets are rough guesses, need to be tuned. In pageserver we have a download timeout of 120s, so I think the 100s bucket is quite nice.

@koivunej koivunej force-pushed the remote_storage_metrics branch from 4c6e53a to f00f900 Compare August 3, 2023 15:32
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release

@koivunej koivunej marked this pull request as ready for review August 3, 2023 15:59
@koivunej koivunej requested a review from a team as a code owner August 3, 2023 15:59
@koivunej koivunej requested review from skyzh and removed request for a team August 3, 2023 15:59
Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Have you tested it locally? Not sure its easy to construct a test case that exercises failures. There should be a case when we check for index part and get 404, this can be an easy one to add a check (sadly dont remember what is the test name)

@koivunej
Copy link
Member Author

koivunej commented Aug 3, 2023

I was hoping to test it on staging, but I'll try to find that test case later.

@koivunej
Copy link
Member Author

koivunej commented Aug 4, 2023

I did manual checking on that particular test case you found test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-mock_s3-timeline-delete-after-index-delete]:

remote_storage_s3_failures_count{request_type="delete_object"} 0 => 0
remote_storage_s3_failures_count{request_type="get_object"} 0 => 0
remote_storage_s3_failures_count{request_type="list_objects"} 0 => 0
remote_storage_s3_failures_count{request_type="put_object"} 0 => 0

remote_storage_s3_requests_count{request_type="delete_object"} 51 => 0
remote_storage_s3_requests_count{request_type="get_object"} 0 => 2
remote_storage_s3_requests_count{request_type="list_objects"} 1 => 0
remote_storage_s3_requests_count{request_type="put_object"} 93 => 0

# histogram count as "stand-in" for actual histogram buckets
# i now realize that these are now duplicating the remote_storage_s3_requests_count

remote_storage_s3_request_seconds_count{request_type="delete_object",result="ok"} 51 => 0
remote_storage_s3_request_seconds_count{request_type="delete_object",result="err"} 0 => 0

remote_storage_s3_request_seconds_count{request_type="get_object",result="ok"} 0 => 1 
remote_storage_s3_request_seconds_count{request_type="get_object",result="err"} 0 => 1

remote_storage_s3_request_seconds_count{request_type="list_objects",result="ok"} 1 => 0
remote_storage_s3_request_seconds_count{request_type="list_objects",result="err"} 0 => 0

remote_storage_s3_request_seconds_count{request_type="put_object",result="ok"} 93 => 0
remote_storage_s3_request_seconds_count{request_type="put_object",result="err"} 0 => 0

remote_storage_s3_wait_seconds_count{request_type="delete_object"} 51 => 0 (matches request_seconds_count)
remote_storage_s3_wait_seconds_count{request_type="get_object"} 0 => 2
remote_storage_s3_wait_seconds_count{request_type="list_objects"} 1 => 0
remote_storage_s3_wait_seconds_count{request_type="put_object"} 93 => 0

Also realized that the remote_storage_s3_{requests,failures}_count and the built-in count on histograms ("{ok,err}") is now duplicate information. Need to check if many dashboards are relying on this.

@koivunej koivunej force-pushed the remote_storage_metrics branch from f24181c to 5e82f26 Compare August 4, 2023 14:07
this happens by adding a yet another IntCounterVec called
`remote_storage_s3_cancelled_waits_total` to track how many
cancellations happen while waiting for semaphore, and a new dimension
for `remote_storage_s3_request_seconds` for requests which end up
cancelled.
@koivunej koivunej force-pushed the remote_storage_metrics branch from 5e82f26 to 3172d0b Compare August 4, 2023 14:47
Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nits

WDYT if we split s3_bucket.rs to move metrics out of it? Metrics now takes half of it (excluding tests)

@koivunej koivunej enabled auto-merge (squash) August 4, 2023 17:24
@koivunej koivunej merged commit ea3e1b5 into main Aug 4, 2023
@koivunej koivunej deleted the remote_storage_metrics branch August 4, 2023 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants