Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics #29

Merged
merged 22 commits into from
Jun 17, 2022
Merged

Prometheus metrics #29

merged 22 commits into from
Jun 17, 2022

Conversation

enriquefynn
Copy link
Member

Add prometheus metrics to Solana, served on the RPC url and port at /prometheus path.

  • solana_bank_slot: Current Slot.
  • solana_bank_epoch: Current Epoch.
  • solana_bank_successful_transaction_count: Number of successful transactions, including votes.
  • solana_bank_error_transaction_count: Number of transactions that resulted in errors.
  • solana_cluster_identity_public_key_info: Public key for the node's identity.
  • solana_cluster_identity_balance_total: Balance of the node's identity.
  • solana_cluster_node_version_info: Solana version.

@enriquefynn enriquefynn requested a review from ruuda June 8, 2022 11:19
Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty nice so far! I think this is a really elegant setup, put everything in its own crate, and add a minimally invasive hook in the current RPC service. I think there is a good chance of this being accepted upstream as it touches so little existing code. 👍

.body(hyper::Body::from(self.health_check()))
.unwrap()
.into(),
"/prometheus" => hyper::Response::builder()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard location for exposing Prometheus metrics is at /metrics.

Suggested change
"/prometheus" => hyper::Response::builder()
"/metrics" => hyper::Response::builder()

@@ -0,0 +1,306 @@
// SPDX-FileCopyrightText: 2022 Chorus One AG
// SPDX-License-Identifier: GPL-3.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put our GPL-licensed code from Solido in here, I expect it would not be accepted upstream. But we own the copyright on this, so we could decide to license it under Apache here.

use std::{io, sync::Arc, time::SystemTime};

pub fn write_bank_metrics<W: io::Write>(
at: SystemTime,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this at in solido because it proxies metrics, it has an internal polling loop that is not in sync with when Prometheus fetches the metrics. But now that we are building the metrics into Solana, we can get the latest values right at the moment that Prometheus polls us, so there is no need to add the timestamps.

name: "solana_bank_slot",
help: "Current Slot",
type_: "gauge",
metrics: vec![Metric::new(bank.slot()).at(at)],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, Solana tracks multiple banks for the multiple commitment levels (the finalized one, the confirmed one, etc.)

I think it would make sense to expose metrics for all of them with different labels, e.g. solana_bank_slot{commitment="finalized"}, solana_bank_slot{commitment="unconfirmed"}, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this at a later point? I can't find the API in the structs, have to do some digging on which slots to get from banks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing all of {unconfirmed,confirmed,finalized} separately, yes we can do that at a later point, one of them is fine for now.

But we do have to know which one it is currently! If we don’t know which bank we get, then we don’t know what the metric means! We should also add the label already, so that there is no ambiguity in the data once we start collecting it.

out,
&MetricFamily {
name: "solana_bank_successful_transaction_count",
help: "Number of transactions in the block that executed successfully",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only about the most recent block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is the bank at the highest slot

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful that is, if Prometheus samples every ~15 seconds, it would sample once every ~19 blocks. I guess it could show high-level trends if we average over some time window, but I expect it to be very noisy. A counter would be much nicer, because it doesn’t miss events that happened in between the two sampling events. But if we only sample the bank on demand, then we can’t build a counter, unless Solana already tracks one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm.. that's fair, I'll delete it for now

write_metric(
out,
&MetricFamily {
name: "solana_cluster_identity_balance_total",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "solana_cluster_identity_balance_total",
name: "solana_cluster_identity_balance_sol",

The _total suffix is typically used for counters that count the total number of times something happened since startup. (E.g. SOL spent on votes.) But in this case we have a gauge, it can go up and down.

Also, we should include the unit in the name. See also https://prometheus.io/docs/practices/naming/

Great idea to include this metric by the way!

prometheus/src/cluster_metrics.rs Show resolved Hide resolved
prometheus/src/cluster_metrics.rs Show resolved Hide resolved
bank_forks: &Arc<RwLock<BankForks>>,
cluster_info: &Arc<ClusterInfo>,
) -> Vec<u8> {
let current_bank = bank_forks.read().unwrap().working_bank();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the bank_forks the thing where we can get the banks for the different commitment levels from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the latest, I couldn't find an API in the struct to get slots at different commitment levels. We can do it at a later time

@@ -0,0 +1,196 @@
use std::{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice utility from Solido, but it looks like overkill for what we need here. I think just a struct Lamports(pub u64) would also suffice.

name: "solana_cluster_identity_balance_total",
help: "The current node's identity balance",
type_: "gauge",
metrics: vec![Metric::new_sol(identity_balance).at(at)],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Al also, we should add a label to this metric with the account’s pubkey. That way there cannot be any ambiguity about which account we are measuring the balance of.

@ruuda
Copy link

ruuda commented Jun 9, 2022

One more idea for a metric to expose (doesn’t have to be in this PR): the clock sysvar’s current time. If Solana’s clock drifts again, we will be able to observe how bad it is.

@enriquefynn enriquefynn requested a review from ruuda June 13, 2022 21:09
@ruuda
Copy link

ruuda commented Jun 15, 2022

Why did you delete dependabot.yml? If we can we should turn it off in the repository settings; let’s not diverge more than necessary from upstream.

Edit: Ah, I see the spam now, and there seems to be no way to disable it on a repository level aside from that file 😕 Thank you for deleting it.

Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one pubkey label still missing. Also, we should find out what version of the bank we are reading from now and add its commitment level as labels too, even if we only read from one bank right now.

Aside from that this looks good!

out,
&MetricFamily {
name: "solana_block_timestamp_seconds",
help: "The block's timestamp",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help: "The block's timestamp",
help: "The block's UNIX timestamp, in seconds since epoch, UTC",

name: "solana_node_identity_balance_sol",
help: "The node's current identity balance",
type_: "gauge",
metrics: vec![Metric::new_sol(identity_balance)],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the identity pubkey as label here, both to remove all ambiguity, but also because we might want to monitor multiple accounts, so our validators can monitor each other.

@enriquefynn enriquefynn requested a review from ruuda June 15, 2022 11:53
Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more comment, LGTM aside from this. Also, are you sure it’s the finalized bank?

name: "solana_block_slot",
help: "Current Slot",
type_: "gauge",
metrics: vec![Metric::new(clock.slot).with_label("commitment_level", "finalized")],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the commitment level label to all the metrics we get from the bank.

@enriquefynn
Copy link
Member Author

I have one more comment, LGTM aside from this. Also, are you sure it’s the finalized bank?

Hum... I think it's not finalized, but confirmed.

It's difficult to see from the code, but my guess is that Processed comes from OptimisticallyConfirmedBank and we can get finalized from the bank

@enriquefynn
Copy link
Member Author

Added a struct so we can get different levels of block commitments, this was the least invasive/clean thing that I could think of. We could modify somewhat the how the RPC service works so it's a bit more clean, but I think this solution is appropriate.

In this version we can get only the finalized bank and we can add confirmed processed at a later time, maybe using some macro.
Please take another look

@enriquefynn enriquefynn requested a review from ruuda June 16, 2022 14:43
banks_with_commitments: &BanksWithCommitments,
out: &mut W,
) -> io::Result<()> {
let clock_finalized = banks_with_commitments.finalized_bank.clock();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is here, it should be really easy to generalize to all commitment levels! Something like

fn for_each_commitment<F: FnMut(&Bank) -> Metric>(
    self: &BanksWithCommitments,
    get: F
) -> Vec<Metric> {
    vec![
        get(&*self.finalized_bank).with_label("commitment_level", "finalized".to_owned()),
        get(&*self.confirmed_bank).with_label("commitment_level", "confirmed".to_owned()),
        get(&*self.processed_bank).with_label("commitment_level", "processed".to_owned()),
    ]
}

and then we can do

    write_metric(
        out,
        &MetricFamily {
            name: "solana_block_slot",
            help: "Finalized Slot",
            type_: "gauge",
            metrics: banks.for_each_commitment(|bank|
                Metric::new(bank.clock().slot)
            ),
        },
    )?;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then every time we'd do the .clock(). Probably the compiler is smart enough to optimize this

Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving into that, I now understand a lot more about how these bank forks work, and I am confident that we are reading the right one. Now that we have the forks, I believe it shouldn’t be too difficult to generalize all metrics over all forks.


We kind of have the metrics inside-out ... per metric, we would iterate over all bank forks. If there is some intermediate struct that we access and produce multiple metrics from, then that is a bit annoying because we have to do it once per metric. We could also switch the loops. Have one big loop over the bank forks, and generate all metrics inside. One downside of this, is that it’s not really how our current metrics library works. Either we put collect the Metrics in vecs and format them at the end (it sounds cumbersome, but maybe it will actually be fine?), or we just format every metric multiple times. The downside of the latter is that the # TYPE and # HELP lines are repeated, and it will be a bit harder for humans to read the output that way. But Prometheus shouldn’t care, so it is a valid option.

.unwrap()
.into(),
"/metrics" => {
let rpc_processor = self.rpc_processor.as_ref().unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked into the code for BankForks a bit, it looks like BankForks contains almost everything we need! We can do .get(slot) to get the bank for a particular slot. It somehow feels wrong to pass in JsonRpcRequestProcessor in the middleware that goes in front of the JsonRpcRequestProcessor ...

There is BlockCommitmentCache in JsonRpcRequestService. It looks like that stores the number of confirmations per block, and it can give us the slot for a certain commitment level. It is available when the RpcRequestMiddleware is constructed, so we can pass in that commitment cache, instead of a full JsonRpcRequestService. From there we can get the slots:

https://github.com/solana-labs/solana/blob/d9c259a2313f620b4b64b7cec0d55dd4234fbcd9/runtime/src/commitment.rs#L115-L125

For the block height metrics, we wouldn’t even need to get that slot’s bank and read its clock, we would get the slot directly.

Interesting aside, so it looks like Solana’s BankForks can only store a single bank per slot. So if there is a fork, there is no way for it to track multiple forks simultaneously. It would have to throw out all progress on one fork if it decides to switch to another.

rpc/src/rpc.rs Outdated
@@ -186,7 +186,7 @@ impl Metadata for JsonRpcRequestProcessor {}

impl JsonRpcRequestProcessor {
#[allow(deprecated)]
fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> {
pub fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn’t need to touch this one if we go with the commitment cache.

cluster_info: &Arc<ClusterInfo>,
) -> Vec<u8> {
let current_bank = bank_forks.read().unwrap().working_bank();
// There are 3 levels of commitment for a bank:
// - finalized: most recent block *confirmed* by supermajority of the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validators don’t “confirm” blocks, they only vote on blocks.

Finalized means that there are 32 descendants of the block that a supermajority of the cluster have voted on. (I might be off by one here.) I now also understand what all this “rooted” and “root slot” business is: they call the highest slot that is finalized the “root”. I suppose because it’s the root of the bank forks, other blocks on top could in theory be replaced by some other fork, but for the “root” the timeout is so long that we don’t consider it revertable.

Copy link
Member Author

@enriquefynn enriquefynn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the copy-paste from their docs

😬 Never mind me then.

But those docs do say something more:

the node will query the most recent block confirmed by supermajority of the cluster as having reached maximum lockout, meaning the cluster has recognized this block as finalized

(emphasis mine)

I think the use of the word “confirmed” in that sentence is a bit misleading ... from the node we are querying’s point of view, it knows that a supermajority has voted on 32 descendants of the block (and that’s a lower bound on what happened in the network), which means the block has maximum lockout, i.e. a supermajority pledged to not vote for conflicting forks of this block for basically forever.

Instead of propagating `rpc_processor`, use the `block_commitment_cache`
structure to get different bank confirmation levels
@enriquefynn
Copy link
Member Author

Hey @ruuda, thanks a lot for the review, it's really improving the way this looks. I added a few more suggestions. Instead of propagating the rpc_processor which does sound a bit odd, I added the BlockCommitmentCache struct to the middleware, from there we can get the different banks with a good API.
If you can take another look, I'd be very happy 🙏

@enriquefynn enriquefynn requested a review from ruuda June 16, 2022 17:15
Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one small (but important) change needed to avoid a deadlock.

bank_forks: &Arc<RwLock<BankForks>>,
block_commitment_cache: &Arc<RwLock<BlockCommitmentCache>>,
) -> Self {
let block_commitment_cache = block_commitment_cache.read().unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to a deadlock

This creates a lock guard that lives until the end of the scope, which means we hold the read lock on block_commitment_cache for the entire duration of the function, even though we no longer need it after we extract the slots.

Additionally, we take the r_bank_forks read lock later in this function, so then we hold two locks. Holding two locks simultaneously is dangerous if you don’t coordinate it very well among all other places that hold the same locks. The following scenario could happen:

  • We take the read lock on block_commitment_cache.
  • Some other thread takes a write lock on bank_forks.
  • We want to take the read lock on bank_forks, but this blocks because the other thread holds a write lock.
  • The other thread wants to take a write lock on block_commitment_cache, but that blocks because we hold the read lock.
  • Deadlock!

I don’t want to audit the entire Solana codebase for all usages of those two locks to confirm that they are always acquired in the same order, and even if they are right now, there is no guarantee that nobody will mess it up in the future.

Fortunately, in this case we can easily avoid holding the locks at the same time:

let (finalized_slot, confirmed_slot, processed_slot) = {
    let block_commitment_cache = block_commitment_cache.read().unwrap();
    (
         block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Finalized),
         block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Confirmed),
         block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Processed),
    )
};

Now the lock guard will go out of scope at the }, before we acquire the lock on bank_forks.

Fixing the deadlock may lead to inconsistencies

So now we can avoid the deadlock, but there is another problem. The bank forks and commitment cache really belong together, the commitment cache refers to slots that the bank holds. Assuming the same order as above (we read the commitment cache, then the bank forks; some other thread writes the bank forks, then updates the cache), I can imagine the following would happen:

  • We read the commitment cache and find slot x is the finalized slot.
  • The other thread just received a new confirmed block, locks the bank forks, and puts the new block in. The new finalized slot is slot x + 1, and the bank for slot x gets evicted.
  • The other thread then updates the commitment cache to reflect the new slots, so that the commitment cache would return slot x + 1 for level finalized.
  • We try to get the bank for slot x, but by now that slot has been evicted from the bank forks.

Now I haven’t looked into detail in how the bank forks work, and when things get evicted, and how the commitment config and bank forks are updated, so this is speculation on my part. But I wouldn’t be surprised at all if this scenario is the cause of solana-labs#11078 linked below. (Which was closed because it was “fixed” by this workaround that we also apply here.)

With this insight, there is an alternative workaround: instead of falling back to to the root slot (which is the same as the one with finalized commitment level), if we try to get a slot but it’s not there, that means the bank forks have changed with respect to the commitment cache since we read the commitment cache, and we could start over. We read both right after one another, so this case should be rare, and if there is say even a 10% probability that we need a restart, then the probability that we still haven’t succeeded after n attempts goes to 0 rapidly as n grows ... assuming that these events are really independent — in practice there is no guarantee that this process ever terminates.

So as a hack for rare situations, I think the current workaround is good enough.

Proper solution (Rant)

I think it’s a design mistake on Solana’s part to put these two things in different locks. It creates all kinds of very subtle problems, and as solana-labs#11078 demonstrates, they are not just hypothetical. To me it looks like BankForks and CommitmentCache should really be merged into one thing, or at the very least they should be in a RwLock<(BankForks, CommitmentCache)>, instead of having their own lock.

(Even nicer of course would be the “private producer + atomic swap” approach that we use in Solido, instead of an RwLock, such that reads can proceed even if a write is in progress, and such that writes don’t need to block on concurrent reads.)

But let’s not go down this rabbit hole, the Solana codebase is the way it is, and we have to deal with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Indeed there's contention on these two locks, should be more careful when doing read() more than once. I changed the guard to limit its scope as you said. It would be great if Solana changed this the way we do it in Solido 😬.

As @ruuda pointed out, there is a deadlock if we acquire a read lock for
both `block_commitment_cache` and `bank_forks`, in @ruuda words:
- We take the read lock on block_commitment_cache.
- Some other thread takes a write lock on bank_forks.
- We want to take the read lock on bank_forks, but this blocks because
  the other thread holds a write lock.
- The other thread wants to take a write lock on block_commitment_cache,
  but that blocks because we hold the read lock.
- Deadlock!

Instead, we guard the `block_commitment_cache` read lock so we don't
hold two read locks concurrently. This might lead, as @ruuda also
pointed out, to an inconsistency that's known to Solana and dealt
already in the code the same way it's dealt in `rpc/src/rpc.rs`.
Copy link

@ruuda ruuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@enriquefynn enriquefynn merged commit 41ae0b0 into v1.9.28-c1 Jun 17, 2022
@enriquefynn enriquefynn deleted the prometheus_metrics branch June 17, 2022 14:15
ruuda pushed a commit that referenced this pull request Jul 24, 2024
* Combine builtin and BPF execution cost into programs_execution_cost since VM has started to consume CUs uniformly

* update tests

* apply suggestions from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants