Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc server metrics impl #3913

Merged
merged 5 commits into from
Jul 26, 2023
Merged

rpc server metrics impl #3913

merged 5 commits into from
Jul 26, 2023

Conversation

0xprames
Copy link
Contributor

@0xprames 0xprames commented Jul 25, 2023

skeleton draft PR for #3907, please excuse the formatting changes my IDE's auto save rust fmt applied. If needbe - I can change to the repo's fmt standard, not sure what's different with my setup

@mattsse if this looks on the right path I can keep at it - questions I have rn revolve around:

  • does the adding type param RpcServerMetrics to the jsonrpsee::server::Server<Identity, L: ()> in WsHttpServerKind seem OK or is there a cleaner way to do it

  • the metrics scope implementation rn changes relative to whether the WS/HTTP servers are combined or not, should it be that way?

and generally haven't started thinking through what exact metrics we want to track in RpcServerMetrics

@@ -1242,7 +1246,7 @@ impl RpcServerConfig {
http_cors_domains: Some(http_cors.clone()),
ws_cors_domains: Some(ws_cors.clone()),
}
.into())
.into());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was done by my ide autofmt on save - ill try and take some time to revert these in the next commit

@0xprames 0xprames marked this pull request as draft July 25, 2023 17:28
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this is a great draft

we want nightly formatting: cargo +nightly fmt

you should be instructions for vscode for this somewhere

#[derive(Metrics, Clone)]
#[metrics(dynamic = true)]
pub(crate) struct RpcServerMetrics {
//TODO:: define relevant metric attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's what we want:

calls_started: Counter,
successful_calls: Counter,
failed_calls: Counter,
requests_started: Counter,
requests_finished: Counter,
ws_sessions_opened: Counter,
ws_sessions_closed: Counter

maybe a histogram for call durations that tracks milliseconds, but we could do this seprately

the active/open numbers are then derived by comparing these values in the dashboard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I pushed up my changes before seeing this comment.

I've got:

    /// The number of ws requests currently being served
    ws_req_count: Gauge,
    /// The number of http requests currently being served
    http_req_count: Gauge,
    /// The number of ws sessions currently active
    ws_session_count: Gauge,
    /// The number of http connections currently active
    http_session_count: Gauge,
    /// Latency for a single request/response pair
    request_latency: Histogram,

and inline:

 // note that on_call will be called multiple times in case of a batch request
        // increment method call count, use macro because derive(Metrics) doesnt seem to support dynamically configuring metric name (?)
        let metric_call_count_name =
            format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_COUNT_METRIC, "_", method_name);
        counter!(metric_call_count_name, 1); // this could be a gauge since one call here should map to one "result" in on_result
    }
    fn on_result(
        &self,
        method_name: &str,
        success: bool,
        started_at: Self::Instant,
        _transport: TransportProtocol,
    ) {
        // capture method call latency, use macro because of the same reason stated in on_call
        let metric_name_call_latency =
            format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name);
        histogram!(metric_name_call_latency, started_at.elapsed());
        if !success {
            // capture error count for method call, use macro because of the same reason stated in on_call
            let metric_name_call_error_count =
                format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_ERROR_METRIC, "_", method_name);
            counter!(metric_name_call_error_count, 1);
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me reconcile that with what you asked for

Copy link
Contributor Author

@0xprames 0xprames Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so the active/open sessions and requests I'm calculating with a gauge - and incrementing on_connect/req and decrementing on_disconnect/response is that OK?

edit: tagging you just to make sure @mattsse

I can go the counter route with an open counter and closed counter like you suggested as well. this seemed to make more sense to me though

Copy link
Contributor Author

@0xprames 0xprames Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright I went ahead and reconciled what you requested with what I had before

mainly:

  • keeping the per method call metrics I had originally added - I think in production, this level of granularity will help for debugging purposes and isolating issues with a specific method

  • modifying the open/closed session counters to separate counters instead of using one gauge for both

@0xprames 0xprames force-pushed the add-rpc-metrics branch 2 times, most recently from 158574e to 986e295 Compare July 25, 2023 23:52
@0xprames 0xprames marked this pull request as ready for review July 25, 2023 23:53
@0xprames 0xprames requested a review from mattsse July 25, 2023 23:53
@0xprames 0xprames changed the title rpc server metrics skeleton impl rpc server metrics impl Jul 26, 2023
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this looks great.

I'd like to hold off on anything dynamic for now, but other than that this looks great

Comment on lines 64 to 68
// increment call count per method, use macro because derive(Metrics) doesnt seem to support
// dynamically configuring metric name (?)
let metric_call_count_name =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_COUNT_METRIC, "_", method_name);
counter!(metric_call_count_name, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit expensive and we don't really need this dynamic value, so this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure makes sense

// capture per method call latency
let metric_name_call_latency =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name);
histogram!(metric_name_call_latency, started_at.elapsed().as_millis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

Copy link
Contributor Author

@0xprames 0xprames Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I left the generic call latency metric and removed this per-method-call one dynamic one

it may make sense to somehow add metrics with a method call dimension back in at some point

Comment on lines 79 to 81
// capture per method call latency
let metric_name_call_latency =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_LATENCY_METRIC, "_", method_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this first,
until then we should remove dynamic labels like this

Comment on lines 85 to 87
// capture error count per method call
let metric_name_call_error_count =
format!("{}{}{}{}{}", METRICS_SCOPE, "_", CALL_ERROR_METRIC, "_", method_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@0xprames 0xprames requested a review from mattsse July 26, 2023 12:30
@0xprames
Copy link
Contributor Author

rebased against mainline and pushed up to pull in the commits that were just merged in

Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great work,tysm

now we can integrate this into the dashboard, @Rjected probably has some ideas for this, will open an issue

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #3913 (ec43302) into main (d5ea168) will decrease coverage by 0.03%.
The diff coverage is 98.24%.

Impacted file tree graph

Files Changed Coverage Δ
crates/rpc/rpc-builder/src/metrics.rs 98.00% <98.00%> (ø)
crates/rpc/rpc-builder/src/lib.rs 65.47% <100.00%> (+0.21%) ⬆️

... and 10 files with indirect coverage changes

Flag Coverage Δ
integration-tests 15.54% <98.24%> (+0.06%) ⬆️
unit-tests 64.42% <0.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
reth binary 27.24% <ø> (ø)
blockchain tree 83.01% <ø> (ø)
pipeline 89.68% <ø> (ø)
storage (db) 74.19% <ø> (ø)
trie 94.70% <ø> (ø)
txpool 46.00% <ø> (-0.60%) ⬇️
networking 77.65% <ø> (-0.06%) ⬇️
rpc 58.54% <98.24%> (+0.16%) ⬆️
consensus 64.46% <ø> (ø)
revm 33.68% <ø> (ø)
payload builder 6.61% <ø> (ø)
primitives 88.03% <ø> (-0.01%) ⬇️

@mattsse mattsse added this pull request to the merge queue Jul 26, 2023
Merged via the queue into paradigmxyz:main with commit caa2683 Jul 26, 2023
@0xprames
Copy link
Contributor Author

ah I see - my rust-analyzer plugin should've picked up those warnings/compiler errors that you fixed, but it was yelling about some other issue which I didn't notice:
Screenshot 2023-07-26 at 9 37 51 AM

I assumed it was all good and didn't rerun a cargo build woops.

thanks for the touchups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants