-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metrics #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty nice so far! I think this is a really elegant setup, put everything in its own crate, and add a minimally invasive hook in the current RPC service. I think there is a good chance of this being accepted upstream as it touches so little existing code. 👍
rpc/src/rpc_service.rs
Outdated
.body(hyper::Body::from(self.health_check())) | ||
.unwrap() | ||
.into(), | ||
"/prometheus" => hyper::Response::builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard location for exposing Prometheus metrics is at /metrics
.
"/prometheus" => hyper::Response::builder() | |
"/metrics" => hyper::Response::builder() |
prometheus/src/utils.rs
Outdated
@@ -0,0 +1,306 @@ | |||
// SPDX-FileCopyrightText: 2022 Chorus One AG | |||
// SPDX-License-Identifier: GPL-3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we put our GPL-licensed code from Solido in here, I expect it would not be accepted upstream. But we own the copyright on this, so we could decide to license it under Apache here.
prometheus/src/bank_metrics.rs
Outdated
use std::{io, sync::Arc, time::SystemTime}; | ||
|
||
pub fn write_bank_metrics<W: io::Write>( | ||
at: SystemTime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this at
in solido
because it proxies metrics, it has an internal polling loop that is not in sync with when Prometheus fetches the metrics. But now that we are building the metrics into Solana, we can get the latest values right at the moment that Prometheus polls us, so there is no need to add the timestamps.
prometheus/src/bank_metrics.rs
Outdated
name: "solana_bank_slot", | ||
help: "Current Slot", | ||
type_: "gauge", | ||
metrics: vec![Metric::new(bank.slot()).at(at)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, Solana tracks multiple banks for the multiple commitment levels (the finalized one, the confirmed one, etc.)
I think it would make sense to expose metrics for all of them with different labels, e.g. solana_bank_slot{commitment="finalized"}
, solana_bank_slot{commitment="unconfirmed"}
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this at a later point? I can't find the API in the structs, have to do some digging on which slots to get from banks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing all of {unconfirmed,confirmed,finalized}
separately, yes we can do that at a later point, one of them is fine for now.
But we do have to know which one it is currently! If we don’t know which bank we get, then we don’t know what the metric means! We should also add the label already, so that there is no ambiguity in the data once we start collecting it.
prometheus/src/bank_metrics.rs
Outdated
out, | ||
&MetricFamily { | ||
name: "solana_bank_successful_transaction_count", | ||
help: "Number of transactions in the block that executed successfully", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only about the most recent block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is the bank at the highest slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful that is, if Prometheus samples every ~15 seconds, it would sample once every ~19 blocks. I guess it could show high-level trends if we average over some time window, but I expect it to be very noisy. A counter would be much nicer, because it doesn’t miss events that happened in between the two sampling events. But if we only sample the bank on demand, then we can’t build a counter, unless Solana already tracks one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm.. that's fair, I'll delete it for now
prometheus/src/cluster_metrics.rs
Outdated
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_cluster_identity_balance_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "solana_cluster_identity_balance_total", | |
name: "solana_cluster_identity_balance_sol", |
The _total
suffix is typically used for counters that count the total number of times something happened since startup. (E.g. SOL spent on votes.) But in this case we have a gauge, it can go up and down.
Also, we should include the unit in the name. See also https://prometheus.io/docs/practices/naming/
Great idea to include this metric by the way!
prometheus/src/lib.rs
Outdated
bank_forks: &Arc<RwLock<BankForks>>, | ||
cluster_info: &Arc<ClusterInfo>, | ||
) -> Vec<u8> { | ||
let current_bank = bank_forks.read().unwrap().working_bank(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the bank_forks
the thing where we can get the banks for the different commitment levels from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the latest, I couldn't find an API in the struct to get slots at different commitment levels. We can do it at a later time
prometheus/src/token.rs
Outdated
@@ -0,0 +1,196 @@ | |||
use std::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice utility from Solido, but it looks like overkill for what we need here. I think just a struct Lamports(pub u64)
would also suffice.
prometheus/src/cluster_metrics.rs
Outdated
name: "solana_cluster_identity_balance_total", | ||
help: "The current node's identity balance", | ||
type_: "gauge", | ||
metrics: vec![Metric::new_sol(identity_balance).at(at)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Al also, we should add a label to this metric with the account’s pubkey. That way there cannot be any ambiguity about which account we are measuring the balance of.
One more idea for a metric to expose (doesn’t have to be in this PR): the clock sysvar’s current time. If Solana’s clock drifts again, we will be able to observe how bad it is. |
When referring to our node, we use simply `node` as opposed to `cluster`.
Edit: Ah, I see the spam now, and there seems to be no way to disable it on a repository level aside from that file 😕 Thank you for deleting it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one pubkey label still missing. Also, we should find out what version of the bank we are reading from now and add its commitment level as labels too, even if we only read from one bank right now.
Aside from that this looks good!
prometheus/src/bank_metrics.rs
Outdated
out, | ||
&MetricFamily { | ||
name: "solana_block_timestamp_seconds", | ||
help: "The block's timestamp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help: "The block's timestamp", | |
help: "The block's UNIX timestamp, in seconds since epoch, UTC", |
prometheus/src/cluster_metrics.rs
Outdated
name: "solana_node_identity_balance_sol", | ||
help: "The node's current identity balance", | ||
type_: "gauge", | ||
metrics: vec![Metric::new_sol(identity_balance)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the identity pubkey as label here, both to remove all ambiguity, but also because we might want to monitor multiple accounts, so our validators can monitor each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more comment, LGTM aside from this. Also, are you sure it’s the finalized bank?
prometheus/src/bank_metrics.rs
Outdated
name: "solana_block_slot", | ||
help: "Current Slot", | ||
type_: "gauge", | ||
metrics: vec![Metric::new(clock.slot).with_label("commitment_level", "finalized")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the commitment level label to all the metrics we get from the bank.
Hum... I think it's not finalized, but It's difficult to see from the code, but my guess is that |
Added a struct so we can get different levels of block commitments, this was the least invasive/clean thing that I could think of. We could modify somewhat the how the RPC service works so it's a bit more clean, but I think this solution is appropriate. In this version we can get only the |
prometheus/src/bank_metrics.rs
Outdated
banks_with_commitments: &BanksWithCommitments, | ||
out: &mut W, | ||
) -> io::Result<()> { | ||
let clock_finalized = banks_with_commitments.finalized_bank.clock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is here, it should be really easy to generalize to all commitment levels! Something like
fn for_each_commitment<F: FnMut(&Bank) -> Metric>(
self: &BanksWithCommitments,
get: F
) -> Vec<Metric> {
vec![
get(&*self.finalized_bank).with_label("commitment_level", "finalized".to_owned()),
get(&*self.confirmed_bank).with_label("commitment_level", "confirmed".to_owned()),
get(&*self.processed_bank).with_label("commitment_level", "processed".to_owned()),
]
}
and then we can do
write_metric(
out,
&MetricFamily {
name: "solana_block_slot",
help: "Finalized Slot",
type_: "gauge",
metrics: banks.for_each_commitment(|bank|
Metric::new(bank.clock().slot)
),
},
)?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then every time we'd do the .clock()
. Probably the compiler is smart enough to optimize this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diving into that, I now understand a lot more about how these bank forks work, and I am confident that we are reading the right one. Now that we have the forks, I believe it shouldn’t be too difficult to generalize all metrics over all forks.
We kind of have the metrics inside-out ... per metric, we would iterate over all bank forks. If there is some intermediate struct that we access and produce multiple metrics from, then that is a bit annoying because we have to do it once per metric. We could also switch the loops. Have one big loop over the bank forks, and generate all metrics inside. One downside of this, is that it’s not really how our current metrics library works. Either we put collect the Metrics
in vecs and format them at the end (it sounds cumbersome, but maybe it will actually be fine?), or we just format every metric multiple times. The downside of the latter is that the # TYPE
and # HELP
lines are repeated, and it will be a bit harder for humans to read the output that way. But Prometheus shouldn’t care, so it is a valid option.
rpc/src/rpc_service.rs
Outdated
.unwrap() | ||
.into(), | ||
"/metrics" => { | ||
let rpc_processor = self.rpc_processor.as_ref().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked into the code for BankForks
a bit, it looks like BankForks
contains almost everything we need! We can do .get(slot)
to get the bank for a particular slot. It somehow feels wrong to pass in JsonRpcRequestProcessor
in the middleware that goes in front of the JsonRpcRequestProcessor
...
There is BlockCommitmentCache
in JsonRpcRequestService
. It looks like that stores the number of confirmations per block, and it can give us the slot for a certain commitment level. It is available when the RpcRequestMiddleware
is constructed, so we can pass in that commitment cache, instead of a full JsonRpcRequestService
. From there we can get the slots:
For the block height metrics, we wouldn’t even need to get that slot’s bank and read its clock, we would get the slot directly.
Interesting aside, so it looks like Solana’s BankForks
can only store a single bank per slot. So if there is a fork, there is no way for it to track multiple forks simultaneously. It would have to throw out all progress on one fork if it decides to switch to another.
rpc/src/rpc.rs
Outdated
@@ -186,7 +186,7 @@ impl Metadata for JsonRpcRequestProcessor {} | |||
|
|||
impl JsonRpcRequestProcessor { | |||
#[allow(deprecated)] | |||
fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> { | |||
pub fn bank(&self, commitment: Option<CommitmentConfig>) -> Arc<Bank> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn’t need to touch this one if we go with the commitment cache.
cluster_info: &Arc<ClusterInfo>, | ||
) -> Vec<u8> { | ||
let current_bank = bank_forks.read().unwrap().working_bank(); | ||
// There are 3 levels of commitment for a bank: | ||
// - finalized: most recent block *confirmed* by supermajority of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validators don’t “confirm” blocks, they only vote on blocks.
Finalized means that there are 32 descendants of the block that a supermajority of the cluster have voted on. (I might be off by one here.) I now also understand what all this “rooted” and “root slot” business is: they call the highest slot that is finalized the “root”. I suppose because it’s the root of the bank forks, other blocks on top could in theory be replaced by some other fork, but for the “root” the timeout is so long that we don’t consider it revertable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.. This is the copy-paste from their docs 😅: https://docs.solana.com/developing/clients/jsonrpc-api#configuring-state-commitment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the copy-paste from their docs
😬 Never mind me then.
But those docs do say something more:
the node will query the most recent block confirmed by supermajority of the cluster as having reached maximum lockout, meaning the cluster has recognized this block as finalized
(emphasis mine)
I think the use of the word “confirmed” in that sentence is a bit misleading ... from the node we are querying’s point of view, it knows that a supermajority has voted on 32 descendants of the block (and that’s a lower bound on what happened in the network), which means the block has maximum lockout, i.e. a supermajority pledged to not vote for conflicting forks of this block for basically forever.
Instead of propagating `rpc_processor`, use the `block_commitment_cache` structure to get different bank confirmation levels
Hey @ruuda, thanks a lot for the review, it's really improving the way this looks. I added a few more suggestions. Instead of propagating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with one small (but important) change needed to avoid a deadlock.
bank_forks: &Arc<RwLock<BankForks>>, | ||
block_commitment_cache: &Arc<RwLock<BlockCommitmentCache>>, | ||
) -> Self { | ||
let block_commitment_cache = block_commitment_cache.read().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can lead to a deadlock
This creates a lock guard that lives until the end of the scope, which means we hold the read lock on block_commitment_cache
for the entire duration of the function, even though we no longer need it after we extract the slots.
Additionally, we take the r_bank_forks
read lock later in this function, so then we hold two locks. Holding two locks simultaneously is dangerous if you don’t coordinate it very well among all other places that hold the same locks. The following scenario could happen:
- We take the read lock on
block_commitment_cache
. - Some other thread takes a write lock on
bank_forks
. - We want to take the read lock on
bank_forks
, but this blocks because the other thread holds a write lock. - The other thread wants to take a write lock on
block_commitment_cache
, but that blocks because we hold the read lock. - Deadlock!
I don’t want to audit the entire Solana codebase for all usages of those two locks to confirm that they are always acquired in the same order, and even if they are right now, there is no guarantee that nobody will mess it up in the future.
Fortunately, in this case we can easily avoid holding the locks at the same time:
let (finalized_slot, confirmed_slot, processed_slot) = {
let block_commitment_cache = block_commitment_cache.read().unwrap();
(
block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Finalized),
block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Confirmed),
block_commitment_cache.slot_with_commitment(solana_sdk::commitment_config::CommitmentLevel::Processed),
)
};
Now the lock guard will go out of scope at the }
, before we acquire the lock on bank_forks
.
Fixing the deadlock may lead to inconsistencies
So now we can avoid the deadlock, but there is another problem. The bank forks and commitment cache really belong together, the commitment cache refers to slots that the bank holds. Assuming the same order as above (we read the commitment cache, then the bank forks; some other thread writes the bank forks, then updates the cache), I can imagine the following would happen:
- We read the commitment cache and find slot x is the finalized slot.
- The other thread just received a new confirmed block, locks the bank forks, and puts the new block in. The new finalized slot is slot x + 1, and the bank for slot x gets evicted.
- The other thread then updates the commitment cache to reflect the new slots, so that the commitment cache would return slot x + 1 for level finalized.
- We try to get the bank for slot x, but by now that slot has been evicted from the bank forks.
Now I haven’t looked into detail in how the bank forks work, and when things get evicted, and how the commitment config and bank forks are updated, so this is speculation on my part. But I wouldn’t be surprised at all if this scenario is the cause of solana-labs#11078 linked below. (Which was closed because it was “fixed” by this workaround that we also apply here.)
With this insight, there is an alternative workaround: instead of falling back to to the root slot (which is the same as the one with finalized commitment level), if we try to get a slot but it’s not there, that means the bank forks have changed with respect to the commitment cache since we read the commitment cache, and we could start over. We read both right after one another, so this case should be rare, and if there is say even a 10% probability that we need a restart, then the probability that we still haven’t succeeded after n attempts goes to 0 rapidly as n grows ... assuming that these events are really independent — in practice there is no guarantee that this process ever terminates.
So as a hack for rare situations, I think the current workaround is good enough.
Proper solution (Rant)
I think it’s a design mistake on Solana’s part to put these two things in different locks. It creates all kinds of very subtle problems, and as solana-labs#11078 demonstrates, they are not just hypothetical. To me it looks like BankForks
and CommitmentCache
should really be merged into one thing, or at the very least they should be in a RwLock<(BankForks, CommitmentCache)>
, instead of having their own lock.
(Even nicer of course would be the “private producer + atomic swap” approach that we use in Solido, instead of an RwLock
, such that reads can proceed even if a write is in progress, and such that writes don’t need to block on concurrent reads.)
But let’s not go down this rabbit hole, the Solana codebase is the way it is, and we have to deal with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation. Indeed there's contention on these two locks, should be more careful when doing read()
more than once. I changed the guard to limit its scope as you said. It would be great if Solana changed this the way we do it in Solido 😬.
As @ruuda pointed out, there is a deadlock if we acquire a read lock for both `block_commitment_cache` and `bank_forks`, in @ruuda words: - We take the read lock on block_commitment_cache. - Some other thread takes a write lock on bank_forks. - We want to take the read lock on bank_forks, but this blocks because the other thread holds a write lock. - The other thread wants to take a write lock on block_commitment_cache, but that blocks because we hold the read lock. - Deadlock! Instead, we guard the `block_commitment_cache` read lock so we don't hold two read locks concurrently. This might lead, as @ruuda also pointed out, to an inconsistency that's known to Solana and dealt already in the code the same way it's dealt in `rpc/src/rpc.rs`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Combine builtin and BPF execution cost into programs_execution_cost since VM has started to consume CUs uniformly * update tests * apply suggestions from code review
Add prometheus metrics to Solana, served on the RPC url and port at
/prometheus
path.