-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc server metrics impl #3913
rpc server metrics impl #3913
Conversation
crates/rpc/rpc-builder/src/lib.rs
Outdated
@@ -1242,7 +1246,7 @@ impl RpcServerConfig { | |||
http_cors_domains: Some(http_cors.clone()), | |||
ws_cors_domains: Some(ws_cors.clone()), | |||
} | |||
.into()) | |||
.into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was done by my ide autofmt on save - ill try and take some time to revert these in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, this is a great draft
we want nightly formatting: cargo +nightly fmt
you should be instructions for vscode for this somewhere
#[derive(Metrics, Clone)] | ||
#[metrics(dynamic = true)] | ||
pub(crate) struct RpcServerMetrics { | ||
//TODO:: define relevant metric attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's what we want:
calls_started: Counter,
successful_calls: Counter,
failed_calls: Counter,
requests_started: Counter,
requests_finished: Counter,
ws_sessions_opened: Counter,
ws_sessions_closed: Counter
maybe a histogram for call durations that tracks milliseconds, but we could do this seprately
the active/open numbers are then derived by comparing these values in the dashboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I pushed up my changes before seeing this comment.
I've got:
/// The number of ws requests currently being served
ws_req_count: Gauge,
/// The number of http requests currently being served
http_req_count: Gauge,
/// The number of ws sessions currently active
ws_session_count: Gauge,
/// The number of http connections currently active
http_session_count: Gauge,
/// Latency for a single request/response pair
request_latency: Histogram,
and inline:
// note that on_call will be called multiple times in case of a batch request
// increment method call count, use macro because derive(Metrics) doesnt seem to support dynamically configuring metric name (?)
let metric_call_count_name =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_COUNT_METRIC, "_", method_name);
counter!(metric_call_count_name, 1); // this could be a gauge since one call here should map to one "result" in on_result
}
fn on_result(
&self,
method_name: &str,
success: bool,
started_at: Self::Instant,
_transport: TransportProtocol,
) {
// capture method call latency, use macro because of the same reason stated in on_call
let metric_name_call_latency =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name);
histogram!(metric_name_call_latency, started_at.elapsed());
if !success {
// capture error count for method call, use macro because of the same reason stated in on_call
let metric_name_call_error_count =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_ERROR_METRIC, "_", method_name);
counter!(metric_name_call_error_count, 1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me reconcile that with what you asked for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so the active/open sessions and requests I'm calculating with a gauge - and incrementing on_connect/req
and decrementing on_disconnect/response
is that OK?
edit: tagging you just to make sure @mattsse
I can go the counter route with an open counter and closed counter like you suggested as well. this seemed to make more sense to me though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright I went ahead and reconciled what you requested with what I had before
mainly:
-
keeping the per method call metrics I had originally added - I think in production, this level of granularity will help for debugging purposes and isolating issues with a specific method
-
modifying the open/closed session counters to separate counters instead of using one
gauge
for both
158574e
to
986e295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, this looks great.
I'd like to hold off on anything dynamic for now, but other than that this looks great
// increment call count per method, use macro because derive(Metrics) doesnt seem to support | ||
// dynamically configuring metric name (?) | ||
let metric_call_count_name = | ||
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_COUNT_METRIC, "_", method_name); | ||
counter!(metric_call_count_name, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit expensive and we don't really need this dynamic value, so this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure makes sense
// capture per method call latency | ||
let metric_name_call_latency = | ||
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name); | ||
histogram!(metric_name_call_latency, started_at.elapsed().as_millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, I left the generic call latency metric and removed this per-method-call one dynamic one
it may make sense to somehow add metrics with a method call dimension back in at some point
// capture per method call latency | ||
let metric_name_call_latency = | ||
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think about this first,
until then we should remove dynamic labels like this
// capture error count per method call | ||
let metric_name_call_error_count = | ||
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_ERROR_METRIC, "_", method_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
704a65e
to
e2eed7f
Compare
rebased against mainline and pushed up to pull in the commits that were just merged in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great work,tysm
now we can integrate this into the dashboard, @Rjected probably has some ideas for this, will open an issue
Codecov Report
... and 10 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more.
|
skeleton draft PR for #3907, please excuse the formatting changes my IDE's auto save rust fmt applied. If needbe - I can change to the repo's
fmt
standard, not sure what's different with my setup@mattsse if this looks on the right path I can keep at it - questions I have rn revolve around:
does the adding type param
RpcServerMetrics
to thejsonrpsee::server::Server<Identity, L: ()>
inWsHttpServerKind
seem OK or is there a cleaner way to do itthe metrics scope implementation rn changes relative to whether the WS/HTTP servers are combined or not, should it be that way?
and generally haven't started thinking through what exact metrics we want to track in
RpcServerMetrics