Prometheus metrics #29

enriquefynn · 2022-06-08T11:19:30Z

Add prometheus metrics to Solana, served on the RPC url and port at /prometheus path.

solana_bank_slot: Current Slot.
solana_bank_epoch: Current Epoch.
solana_bank_successful_transaction_count: Number of successful transactions, including votes.
solana_bank_error_transaction_count: Number of transactions that resulted in errors.
solana_cluster_identity_public_key_info: Public key for the node's identity.
solana_cluster_identity_balance_total: Balance of the node's identity.
solana_cluster_node_version_info: Solana version.

ruuda

Looks pretty nice so far! I think this is a really elegant setup, put everything in its own crate, and add a minimally invasive hook in the current RPC service. I think there is a good chance of this being accepted upstream as it touches so little existing code. 👍

ruuda · 2022-06-09T07:50:54Z

rpc/src/rpc_service.rs

+                    .body(hyper::Body::from(self.health_check()))
+                    .unwrap()
+                    .into(),
+                "/prometheus" => hyper::Response::builder()


The standard location for exposing Prometheus metrics is at /metrics.

Suggested change

"/prometheus" => hyper::Response::builder()

"/metrics" => hyper::Response::builder()

ruuda · 2022-06-09T07:53:21Z

prometheus/src/utils.rs

@@ -0,0 +1,306 @@
+// SPDX-FileCopyrightText: 2022 Chorus One AG
+// SPDX-License-Identifier: GPL-3.0


If we put our GPL-licensed code from Solido in here, I expect it would not be accepted upstream. But we own the copyright on this, so we could decide to license it under Apache here.

ruuda · 2022-06-09T07:59:46Z

prometheus/src/bank_metrics.rs

+use std::{io, sync::Arc, time::SystemTime};
+
+pub fn write_bank_metrics<W: io::Write>(
+    at: SystemTime,


We use this at in solido because it proxies metrics, it has an internal polling loop that is not in sync with when Prometheus fetches the metrics. But now that we are building the metrics into Solana, we can get the latest values right at the moment that Prometheus polls us, so there is no need to add the timestamps.

ruuda · 2022-06-09T08:01:25Z

prometheus/src/bank_metrics.rs

+            name: "solana_bank_slot",
+            help: "Current Slot",
+            type_: "gauge",
+            metrics: vec![Metric::new(bank.slot()).at(at)],


If I understand correctly, Solana tracks multiple banks for the multiple commitment levels (the finalized one, the confirmed one, etc.)

I think it would make sense to expose metrics for all of them with different labels, e.g. solana_bank_slot{commitment="finalized"}, solana_bank_slot{commitment="unconfirmed"}, etc.

Can we do this at a later point? I can't find the API in the structs, have to do some digging on which slots to get from banks.

Exposing all of {unconfirmed,confirmed,finalized} separately, yes we can do that at a later point, one of them is fine for now.

But we do have to know which one it is currently! If we don’t know which bank we get, then we don’t know what the metric means! We should also add the label already, so that there is no ambiguity in the data once we start collecting it.

ruuda · 2022-06-09T08:02:14Z

prometheus/src/bank_metrics.rs

+        out,
+        &MetricFamily {
+            name: "solana_bank_successful_transaction_count",
+            help: "Number of transactions in the block that executed successfully",


So this is only about the most recent block?

yeah, this is the bank at the highest slot

Not sure how useful that is, if Prometheus samples every ~15 seconds, it would sample once every ~19 blocks. I guess it could show high-level trends if we average over some time window, but I expect it to be very noisy. A counter would be much nicer, because it doesn’t miss events that happened in between the two sampling events. But if we only sample the bank on demand, then we can’t build a counter, unless Solana already tracks one.

humm.. that's fair, I'll delete it for now

ruuda · 2022-06-09T08:07:34Z

prometheus/src/cluster_metrics.rs

+    write_metric(
+        out,
+        &MetricFamily {
+            name: "solana_cluster_identity_balance_total",


Suggested change

name: "solana_cluster_identity_balance_total",

name: "solana_cluster_identity_balance_sol",

The _total suffix is typically used for counters that count the total number of times something happened since startup. (E.g. SOL spent on votes.) But in this case we have a gauge, it can go up and down.

Also, we should include the unit in the name. See also https://prometheus.io/docs/practices/naming/

Great idea to include this metric by the way!

prometheus/src/cluster_metrics.rs

ruuda · 2022-06-09T08:12:03Z

prometheus/src/lib.rs

+    bank_forks: &Arc<RwLock<BankForks>>,
+    cluster_info: &Arc<ClusterInfo>,
+) -> Vec<u8> {
+    let current_bank = bank_forks.read().unwrap().working_bank();


Is the bank_forks the thing where we can get the banks for the different commitment levels from?

This is the latest, I couldn't find an API in the struct to get slots at different commitment levels. We can do it at a later time

ruuda · 2022-06-09T08:13:47Z

prometheus/src/token.rs

@@ -0,0 +1,196 @@
+use std::{


This is a nice utility from Solido, but it looks like overkill for what we need here. I think just a struct Lamports(pub u64) would also suffice.

ruuda · 2022-06-09T08:21:55Z

prometheus/src/cluster_metrics.rs

+            name: "solana_cluster_identity_balance_total",
+            help: "The current node's identity balance",
+            type_: "gauge",
+            metrics: vec![Metric::new_sol(identity_balance).at(at)],


Al also, we should add a label to this metric with the account’s pubkey. That way there cannot be any ambiguity about which account we are measuring the balance of.

ruuda · 2022-06-09T08:40:00Z

One more idea for a metric to expose (doesn’t have to be in this PR): the clock sysvar’s current time. If Solana’s clock drifts again, we will be able to observe how bad it is.

When referring to our node, we use simply `node` as opposed to `cluster`.

ruuda · 2022-06-15T08:52:55Z

~~Why did you delete dependabot.yml? If we can we should turn it off in the repository settings; let’s not diverge more than necessary from upstream.~~

Edit: Ah, I see the spam now, and there seems to be no way to disable it on a repository level aside from that file 😕 Thank you for deleting it.

ruuda

There is one pubkey label still missing. Also, we should find out what version of the bank we are reading from now and add its commitment level as labels too, even if we only read from one bank right now.

Aside from that this looks good!

ruuda · 2022-06-15T09:21:11Z

prometheus/src/bank_metrics.rs

+        out,
+        &MetricFamily {
+            name: "solana_block_timestamp_seconds",
+            help: "The block's timestamp",


Suggested change

help: "The block's timestamp",

help: "The block's UNIX timestamp, in seconds since epoch, UTC",

ruuda · 2022-06-15T09:22:02Z

prometheus/src/cluster_metrics.rs

+            name: "solana_node_identity_balance_sol",
+            help: "The node's current identity balance",
+            type_: "gauge",
+            metrics: vec![Metric::new_sol(identity_balance)],


We should add the identity pubkey as label here, both to remove all ambiguity, but also because we might want to monitor multiple accounts, so our validators can monitor each other.

ruuda

I have one more comment, LGTM aside from this. Also, are you sure it’s the finalized bank?

ruuda · 2022-06-15T13:09:30Z

prometheus/src/bank_metrics.rs

+            name: "solana_block_slot",
+            help: "Current Slot",
+            type_: "gauge",
+            metrics: vec![Metric::new(clock.slot).with_label("commitment_level", "finalized")],


We should add the commitment level label to all the metrics we get from the bank.

enriquefynn · 2022-06-16T12:29:00Z

I have one more comment, LGTM aside from this. Also, are you sure it’s the finalized bank?

Hum... I think it's not finalized, but confirmed.

It's difficult to see from the code, but my guess is that Processed comes from OptimisticallyConfirmedBank and we can get finalized from the bank

enriquefynn · 2022-06-16T14:42:39Z

Added a struct so we can get different levels of block commitments, this was the least invasive/clean thing that I could think of. We could modify somewhat the how the RPC service works so it's a bit more clean, but I think this solution is appropriate.

In this version we can get only the finalized bank and we can add confirmed processed at a later time, maybe using some macro.
Please take another look

ruuda · 2022-06-16T15:01:15Z

prometheus/src/bank_metrics.rs

+    banks_with_commitments: &BanksWithCommitments,
+    out: &mut W,
+) -> io::Result<()> {
+    let clock_finalized = banks_with_commitments.finalized_bank.clock();


Now that this is here, it should be really easy to generalize to all commitment levels! Something like

fn for_each_commitment<F: FnMut(&Bank) -> Metric>( self: &BanksWithCommitments, get: F ) -> Vec<Metric> { vec![ get(&*self.finalized_bank).with_label("commitment_level", "finalized".to_owned()), get(&*self.confirmed_bank).with_label("commitment_level", "confirmed".to_owned()), get(&*self.processed_bank).with_label("commitment_level", "processed".to_owned()), ] }

and then we can do

write_metric( out, &MetricFamily { name: "solana_block_slot", help: "Finalized Slot", type_: "gauge", metrics: banks.for_each_commitment(|bank| Metric::new(bank.clock().slot) ), }, )?;

Then every time we'd do the .clock(). Probably the compiler is smart enough to optimize this

ruuda

Thanks for diving into that, I now understand a lot more about how these bank forks work, and I am confident that we are reading the right one. Now that we have the forks, I believe it shouldn’t be too difficult to generalize all metrics over all forks.

We kind of have the metrics inside-out ... per metric, we would iterate over all bank forks. If there is some intermediate struct that we access and produce multiple metrics from, then that is a bit annoying because we have to do it once per metric. We could also switch the loops. Have one big loop over the bank forks, and generate all metrics inside. One downside of this, is that it’s not really how our current metrics library works. Either we put collect the Metrics in vecs and format them at the end (it sounds cumbersome, but maybe it will actually be fine?), or we just format every metric multiple times. The downside of the latter is that the # TYPE and # HELP lines are repeated, and it will be a bit harder for humans to read the output that way. But Prometheus shouldn’t care, so it is a valid option.

ruuda · 2022-06-16T15:18:40Z

rpc/src/rpc_service.rs

-                    .unwrap()
-                    .into(),
+                "/metrics" => {
+                    let rpc_processor = self.rpc_processor.as_ref().unwrap();


So I looked into the code for BankForks a bit, it looks like BankForks contains almost everything we need! We can do .get(slot) to get the bank for a particular slot. It somehow feels wrong to pass in JsonRpcRequestProcessor in the middleware that goes in front of the JsonRpcRequestProcessor ...

There is BlockCommitmentCache in JsonRpcRequestService. It looks like that stores the number of confirmations per block, and it can give us the slot for a certain commitment level. It is available when the RpcRequestMiddleware is constructed, so we can pass in that commitment cache, instead of a full JsonRpcRequestService. From there we can get the slots:

https://github.com/solana-labs/solana/blob/d9c259a2313f620b4b64b7cec0d55dd4234fbcd9/runtime/src/commitment.rs#L115-L125

For the block height metrics, we wouldn’t even need to get that slot’s bank and read its clock, we would get the slot directly.

_{Interesting aside, so it looks like Solana’s BankForks can only store a single bank per slot. So if there is a fork, there is no way for it to track multiple forks simultaneously. It would have to throw out all progress on one fork if it decides to switch to another.}

ruuda · 2022-06-16T15:23:44Z

rpc/src/rpc.rs

@@ -186,7 +186,7 @@ impl Metadata for JsonRpcRequestProcessor {}

 impl JsonRpcRequestProcessor {
    #[allow(deprecated)]
-    fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> {
+    pub fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> {


We wouldn’t need to touch this one if we go with the commitment cache.

ruuda · 2022-06-16T15:28:19Z

prometheus/src/lib.rs

    cluster_info: &Arc<ClusterInfo>,
 ) -> Vec<u8> {
-    let current_bank = bank_forks.read().unwrap().working_bank();
+    // There are 3 levels of commitment for a bank:
+    // - finalized: most recent block *confirmed* by supermajority of the


Validators don’t “confirm” blocks, they only vote on blocks.

Finalized means that there are 32 descendants of the block that a supermajority of the cluster have voted on. (I might be off by one here.) I now also understand what all this “rooted” and “root slot” business is: they call the highest slot that is finalized the “root”. I suppose because it’s the root of the bank forks, other blocks on top could in theory be replaced by some other fork, but for the “root” the timeout is so long that we don’t consider it revertable.

Well.. This is the copy-paste from their docs 😅: https://docs.solana.com/developing/clients/jsonrpc-api#configuring-state-commitment

This is the copy-paste from their docs

😬 Never mind me then.

But those docs do say something more:

the node will query the most recent block confirmed by supermajority of the cluster as having reached maximum lockout, meaning the cluster has recognized this block as finalized

(emphasis mine)

I think the use of the word “confirmed” in that sentence is a bit misleading ... from the node we are querying’s point of view, it knows that a supermajority has voted on 32 descendants of the block (and that’s a lower bound on what happened in the network), which means the block has maximum lockout, i.e. a supermajority pledged to not vote for conflicting forks of this block for basically forever.

Instead of propagating `rpc_processor`, use the `block_commitment_cache` structure to get different bank confirmation levels

enriquefynn · 2022-06-16T17:15:18Z

Hey @ruuda, thanks a lot for the review, it's really improving the way this looks. I added a few more suggestions. Instead of propagating the rpc_processor which does sound a bit odd, I added the BlockCommitmentCache struct to the middleware, from there we can get the different banks with a good API.
If you can take another look, I'd be very happy 🙏

ruuda

LGTM, with one small (but important) change needed to avoid a deadlock.

ruuda · 2022-06-17T08:14:30Z

prometheus/src/banks_with_commitments.rs

+        bank_forks: &Arc<RwLock<BankForks>>,
+        block_commitment_cache: &Arc<RwLock<BlockCommitmentCache>>,
+    ) -> Self {
+        let block_commitment_cache = block_commitment_cache.read().unwrap();


This can lead to a deadlock

This creates a lock guard that lives until the end of the scope, which means we hold the read lock on block_commitment_cache for the entire duration of the function, even though we no longer need it after we extract the slots.

Additionally, we take the r_bank_forks read lock later in this function, so then we hold two locks. Holding two locks simultaneously is dangerous if you don’t coordinate it very well among all other places that hold the same locks. The following scenario could happen:

We take the read lock on block_commitment_cache.

Some other thread takes a write lock on bank_forks.

We want to take the read lock on bank_forks, but this blocks because the other thread holds a write lock.

The other thread wants to take a write lock on block_commitment_cache, but that blocks because we hold the read lock.

Deadlock!

I don’t want to audit the entire Solana codebase for all usages of those two locks to confirm that they are always acquired in the same order, and even if they are right now, there is no guarantee that nobody will mess it up in the future.

Fortunately, in this case we can easily avoid holding the locks at the same time:

let (finalized_slot, confirmed_slot, processed_slot) = { let block_commitment_cache = block_commitment_cache.read().unwrap(); ( block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Finalized), block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Confirmed), block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Processed), ) };

Now the lock guard will go out of scope at the }, before we acquire the lock on bank_forks.

Fixing the deadlock may lead to inconsistencies

So now we can avoid the deadlock, but there is another problem. The bank forks and commitment cache really belong together, the commitment cache refers to slots that the bank holds. Assuming the same order as above (we read the commitment cache, then the bank forks; some other thread writes the bank forks, then updates the cache), I can imagine the following would happen:

We read the commitment cache and find slot x is the finalized slot.

The other thread just received a new confirmed block, locks the bank forks, and puts the new block in. The new finalized slot is slot x + 1, and the bank for slot x gets evicted.

The other thread then updates the commitment cache to reflect the new slots, so that the commitment cache would return slot x + 1 for level finalized.

We try to get the bank for slot x, but by now that slot has been evicted from the bank forks.

Now I haven’t looked into detail in how the bank forks work, and when things get evicted, and how the commitment config and bank forks are updated, so this is speculation on my part. But I wouldn’t be surprised at all if this scenario is the cause of solana-labs#11078 linked below. (Which was closed because it was “fixed” by this workaround that we also apply here.)

With this insight, there is an alternative workaround: instead of falling back to to the root slot (which is the same as the one with finalized commitment level), if we try to get a slot but it’s not there, that means the bank forks have changed with respect to the commitment cache since we read the commitment cache, and we could start over. We read both right after one another, so this case should be rare, and if there is say even a 10% probability that we need a restart, then the probability that we still haven’t succeeded after n attempts goes to 0 rapidly as n grows ... assuming that these events are really independent — in practice there is no guarantee that this process ever terminates.

So as a hack for rare situations, I think the current workaround is good enough.

Proper solution (Rant)

I think it’s a design mistake on Solana’s part to put these two things in different locks. It creates all kinds of very subtle problems, and as solana-labs#11078 demonstrates, they are not just hypothetical. To me it looks like BankForks and CommitmentCache should really be merged into one thing, or at the very least they should be in a RwLock<(BankForks, CommitmentCache)>, instead of having their own lock.

(Even nicer of course would be the “private producer + atomic swap” approach that we use in Solido, instead of an RwLock, such that reads can proceed even if a write is in progress, and such that writes don’t need to block on concurrent reads.)

But let’s not go down this rabbit hole, the Solana codebase is the way it is, and we have to deal with it.

Thanks for the detailed explanation. Indeed there's contention on these two locks, should be more careful when doing read() more than once. I changed the guard to limit its scope as you said. It would be great if Solana changed this the way we do it in Solido 😬.

@ruuda

As @ruuda pointed out, there is a deadlock if we acquire a read lock for both `block_commitment_cache` and `bank_forks`, in @ruuda words: - We take the read lock on block_commitment_cache. - Some other thread takes a write lock on bank_forks. - We want to take the read lock on bank_forks, but this blocks because the other thread holds a write lock. - The other thread wants to take a write lock on block_commitment_cache, but that blocks because we hold the read lock. - Deadlock! Instead, we guard the `block_commitment_cache` read lock so we don't hold two read locks concurrently. This might lead, as @ruuda also pointed out, to an inconsistency that's known to Solana and dealt already in the code the same way it's dealt in `rpc/src/rpc.rs`.

ruuda

* Combine builtin and BPF execution cost into programs_execution_cost since VM has started to consume CUs uniformly * update tests * apply suggestions from code review

enriquefynn added 4 commits June 3, 2022 14:20

Initial Prometheus support

362689a

Add Solana node's version

7614c15

Rename metric

248820b

Bugfix: balance should be gauge

9092e15

enriquefynn requested a review from ruuda June 8, 2022 11:19

ruuda reviewed Jun 9, 2022

View reviewed changes

enriquefynn added 9 commits June 9, 2022 11:32

delete dependabot from our fork

0a87bd9

Remove at parameter

34ed6c7

Change path from prometheus to metrics

65465c4

Change cluster -> node

3d7e01b

When referring to our node, we use simply `node` as opposed to `cluster`.

Rename metrics

69c590c

Simplify Lamports struct

e0f873e

Change license to Apache 2.0

963d0c8

Add block's timestamp

b0ade1f

Rename metrics

4ecd32e

enriquefynn requested a review from ruuda June 13, 2022 21:09

Turn i64 to u64 timestamp

9b1467a

ruuda reviewed Jun 15, 2022

View reviewed changes

enriquefynn added 4 commits June 15, 2022 08:48

Remove information about block's transactions

be21b15

Add label about commitment level

c9c215f

Change help msg for block's timestamp

607c2de

Add and rename label

f51bc4a

enriquefynn requested a review from ruuda June 15, 2022 11:53

ruuda approved these changes Jun 15, 2022

View reviewed changes

Get finalized bank

9f38c39

enriquefynn requested a review from ruuda June 16, 2022 14:43

ruuda reviewed Jun 16, 2022

View reviewed changes

enriquefynn added 2 commits June 16, 2022 13:03

Add metrics for each commitment level

3a9ab76

Use block_commitment_cache

6111d56

Instead of propagating `rpc_processor`, use the `block_commitment_cache` structure to get different bank confirmation levels

enriquefynn requested a review from ruuda June 16, 2022 17:15

ruuda reviewed Jun 17, 2022

View reviewed changes

ruuda approved these changes Jun 17, 2022

View reviewed changes

enriquefynn merged commit 41ae0b0 into v1.9.28-c1 Jun 17, 2022

enriquefynn deleted the prometheus_metrics branch June 17, 2022 14:15

	"/prometheus" => hyper::Response::builder()
	"/metrics" => hyper::Response::builder()

		@@ -0,0 +1,306 @@
		// SPDX-FileCopyrightText: 2022 Chorus One AG
		// SPDX-License-Identifier: GPL-3.0

	name: "solana_cluster_identity_balance_total",
	name: "solana_cluster_identity_balance_sol",

	help: "The block's timestamp",
	help: "The block's UNIX timestamp, in seconds since epoch, UTC",

Prometheus metrics #29

Prometheus metrics #29

Conversation

enriquefynn commented Jun 8, 2022

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruuda commented Jun 9, 2022

ruuda commented Jun 15, 2022 • edited Loading

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enriquefynn commented Jun 16, 2022

enriquefynn commented Jun 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enriquefynn Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enriquefynn commented Jun 16, 2022

ruuda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This can lead to a deadlock

Fixing the deadlock may lead to inconsistencies

Proper solution (Rant)

Choose a reason for hiding this comment

ruuda left a comment

Choose a reason for hiding this comment

ruuda commented Jun 15, 2022 •

edited

Loading

enriquefynn Jun 16, 2022 •

edited

Loading