-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metrics #29
Changes from 4 commits
362689a
7614c15
248820b
9092e15
0a87bd9
34ed6c7
65465c4
3d7e01b
69c590c
e0f873e
963d0c8
b0ade1f
4ecd32e
9b1467a
be21b15
c9c215f
607c2de
f51bc4a
9f38c39
3a9ab76
6111d56
f60fcda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
[package] | ||
name = "solana-prometheus" | ||
version = "1.9.28" | ||
description = "Solana Prometheus" | ||
authors = ["ChorusOne <[email protected]>"] | ||
repository = "https://github.com/ChorusOne/solana" | ||
license = "Apache-2.0" | ||
edition = "2021" | ||
|
||
[dependencies] | ||
jsonrpc-http-server = "18.0.0" | ||
solana-gossip = { path = "../gossip", version = "=1.9.28" } | ||
solana-runtime = { path = "../runtime", version = "=1.9.28" } | ||
solana-sdk = { path = "../sdk", version = "=1.9.28" } | ||
|
||
[lib] | ||
crate-type = ["lib"] | ||
name = "solana_prometheus" | ||
|
||
[package.metadata.docs.rs] | ||
targets = ["x86_64-unknown-linux-gnu"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
use solana_runtime::bank::Bank; | ||
|
||
use crate::utils::{write_metric, Metric, MetricFamily}; | ||
use std::{io, sync::Arc, time::SystemTime}; | ||
|
||
pub fn write_bank_metrics<W: io::Write>( | ||
at: SystemTime, | ||
bank: &Arc<Bank>, | ||
out: &mut W, | ||
) -> io::Result<()> { | ||
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_bank_slot", | ||
help: "Current Slot", | ||
type_: "gauge", | ||
metrics: vec![Metric::new(bank.slot()).at(at)], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, Solana tracks multiple banks for the multiple commitment levels (the finalized one, the confirmed one, etc.) I think it would make sense to expose metrics for all of them with different labels, e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we do this at a later point? I can't find the API in the structs, have to do some digging on which slots to get from banks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exposing all of But we do have to know which one it is currently! If we don’t know which bank we get, then we don’t know what the metric means! We should also add the label already, so that there is no ambiguity in the data once we start collecting it. |
||
}, | ||
)?; | ||
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_bank_epoch", | ||
help: "Current Epoch", | ||
type_: "gauge", | ||
metrics: vec![Metric::new(bank.epoch()).at(at)], | ||
}, | ||
)?; | ||
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_bank_successful_transaction_count", | ||
help: "Number of transactions in the block that executed successfully", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this is only about the most recent block? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, this is the bank at the highest slot There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure how useful that is, if Prometheus samples every ~15 seconds, it would sample once every ~19 blocks. I guess it could show high-level trends if we average over some time window, but I expect it to be very noisy. A counter would be much nicer, because it doesn’t miss events that happened in between the two sampling events. But if we only sample the bank on demand, then we can’t build a counter, unless Solana already tracks one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. humm.. that's fair, I'll delete it for now |
||
type_: "gauge", | ||
metrics: vec![Metric::new(bank.transaction_count()).at(at)], | ||
}, | ||
)?; | ||
write_metric( | ||
out, | ||
&MetricFamily { | ||
name: "solana_bank_error_transaction_count", | ||
help: "Number of transactions in the block that executed with error", | ||
type_: "gauge", | ||
metrics: vec![Metric::new(bank.transaction_error_count()).at(at)], | ||
}, | ||
)?; | ||
|
||
Ok(()) | ||
} |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,57 @@ | ||||||
use solana_gossip::cluster_info::ClusterInfo; | ||||||
use solana_runtime::bank::Bank; | ||||||
|
||||||
use crate::{ | ||||||
token::Lamports, | ||||||
utils::{write_metric, Metric, MetricFamily}, | ||||||
}; | ||||||
use std::{io, sync::Arc, time::SystemTime}; | ||||||
|
||||||
pub fn write_cluster_metrics<W: io::Write>( | ||||||
ruuda marked this conversation as resolved.
Show resolved
Hide resolved
ruuda marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
at: SystemTime, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also here, we don’t need this |
||||||
bank: &Arc<Bank>, | ||||||
cluster_info: &Arc<ClusterInfo>, | ||||||
out: &mut W, | ||||||
) -> io::Result<()> { | ||||||
let identity_pubkey = cluster_info.id(); | ||||||
let version = cluster_info | ||||||
.get_node_version(&identity_pubkey) | ||||||
.unwrap_or_default(); | ||||||
|
||||||
write_metric( | ||||||
out, | ||||||
&MetricFamily { | ||||||
name: "solana_cluster_identity_public_key_info", | ||||||
help: "The current node's identity", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
The identity can change at runtime! |
||||||
type_: "count", | ||||||
metrics: vec![Metric::new(1) | ||||||
.with_label("identity", identity_pubkey.to_string()) | ||||||
.at(at)], | ||||||
ruuda marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
}, | ||||||
)?; | ||||||
|
||||||
let identity_balance = Lamports(bank.get_balance(&identity_pubkey)); | ||||||
write_metric( | ||||||
out, | ||||||
&MetricFamily { | ||||||
name: "solana_cluster_identity_balance_total", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
The Also, we should include the unit in the name. See also https://prometheus.io/docs/practices/naming/ Great idea to include this metric by the way! |
||||||
help: "The current node's identity balance", | ||||||
type_: "gauge", | ||||||
metrics: vec![Metric::new_sol(identity_balance).at(at)], | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Al also, we should add a label to this metric with the account’s pubkey. That way there cannot be any ambiguity about which account we are measuring the balance of. |
||||||
}, | ||||||
)?; | ||||||
|
||||||
write_metric( | ||||||
out, | ||||||
&MetricFamily { | ||||||
name: "solana_cluster_node_version_info", | ||||||
help: "The current Solana node's version", | ||||||
type_: "count", | ||||||
metrics: vec![Metric::new(1) | ||||||
.with_label("version", version.to_string()) | ||||||
.at(at)], | ||||||
}, | ||||||
)?; | ||||||
|
||||||
Ok(()) | ||||||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
mod bank_metrics; | ||
mod cluster_metrics; | ||
mod token; | ||
mod utils; | ||
|
||
use solana_gossip::cluster_info::ClusterInfo; | ||
use solana_runtime::bank_forks::BankForks; | ||
use std::{ | ||
sync::{Arc, RwLock}, | ||
time::SystemTime, | ||
}; | ||
|
||
pub fn render_prometheus( | ||
bank_forks: &Arc<RwLock<BankForks>>, | ||
cluster_info: &Arc<ClusterInfo>, | ||
) -> Vec<u8> { | ||
let current_bank = bank_forks.read().unwrap().working_bank(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the latest, I couldn't find an API in the struct to get slots at different commitment levels. We can do it at a later time |
||
let now = SystemTime::now(); | ||
let mut out: Vec<u8> = Vec::new(); | ||
bank_metrics::write_bank_metrics(now, ¤t_bank, &mut out).expect("IO error"); | ||
cluster_metrics::write_cluster_metrics(now, ¤t_bank, &cluster_info, &mut out) | ||
.expect("IO error"); | ||
out | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
use std::{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a nice utility from Solido, but it looks like overkill for what we need here. I think just a |
||
convert::TryFrom, | ||
fmt, | ||
iter::Sum, | ||
ops::{Add, Div, Mul, Sub}, | ||
}; | ||
|
||
#[derive(Copy, Clone, PartialEq, Debug)] | ||
pub struct Rational { | ||
pub numerator: u64, | ||
pub denominator: u64, | ||
} | ||
|
||
impl PartialOrd for Rational { | ||
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { | ||
if self.denominator == 0 || other.denominator == 0 { | ||
None | ||
} else { | ||
let x = self.numerator as u128 * other.denominator as u128; | ||
let y = other.numerator as u128 * self.denominator as u128; | ||
Some(x.cmp(&y)) | ||
} | ||
} | ||
} | ||
|
||
impl Div for Rational { | ||
type Output = f64; | ||
|
||
// We do not return a `Rational` here because `self.numerator * | ||
// rhs.denominator` or `rhs.numerator * self.denominator`could overflow. | ||
// Instead we deal with floating point numbers. | ||
fn div(self, rhs: Self) -> Self::Output { | ||
(self.numerator as f64 * rhs.denominator as f64) | ||
/ (self.denominator as f64 * rhs.numerator as f64) | ||
} | ||
} | ||
|
||
impl Rational { | ||
pub fn to_f64(&self) -> f64 { | ||
self.numerator as f64 / self.denominator as f64 | ||
} | ||
} | ||
|
||
/// Error returned when a calculation in a token type overflows, underflows, or divides by zero. | ||
#[derive(Debug, Eq, PartialEq)] | ||
pub struct ArithmeticError; | ||
|
||
pub type Result<T> = std::result::Result<T, ArithmeticError>; | ||
|
||
/// Generate a token type that wraps the minimal unit of the token, it’s | ||
/// “Lamport”. The symbol is for 10<sup>9</sup> of its minimal units and is | ||
/// only used for `Debug` and `Display` printing. | ||
#[macro_export] | ||
macro_rules! impl_token { | ||
($TokenLamports:ident, $symbol:expr, decimals = $decimals:expr) => { | ||
#[derive(Copy, Clone, Default, Eq, Ord, PartialEq, PartialOrd)] | ||
pub struct $TokenLamports(pub u64); | ||
|
||
impl fmt::Display for $TokenLamports { | ||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
write!( | ||
f, | ||
"{}.{} {}", | ||
self.0 / 10u64.pow($decimals), | ||
&format!("{:0>9}", self.0 % 10u64.pow($decimals))[9 - $decimals..], | ||
$symbol | ||
) | ||
} | ||
} | ||
|
||
impl fmt::Debug for $TokenLamports { | ||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
fmt::Display::fmt(self, f) | ||
} | ||
} | ||
|
||
impl Mul<Rational> for $TokenLamports { | ||
type Output = Result<$TokenLamports>; | ||
fn mul(self, other: Rational) -> Result<$TokenLamports> { | ||
// This multiplication cannot overflow, because we expand the | ||
// u64s into u128, and u64::MAX * u64::MAX < u128::MAX. | ||
let result_u128 = ((self.0 as u128) * (other.numerator as u128)) | ||
.checked_div(other.denominator as u128) | ||
.ok_or(ArithmeticError)?; | ||
u64::try_from(result_u128) | ||
.map($TokenLamports) | ||
.map_err(|_| ArithmeticError) | ||
} | ||
} | ||
|
||
impl Mul<u64> for $TokenLamports { | ||
type Output = Result<$TokenLamports>; | ||
fn mul(self, other: u64) -> Result<$TokenLamports> { | ||
self.0 | ||
.checked_mul(other) | ||
.map($TokenLamports) | ||
.ok_or(ArithmeticError) | ||
} | ||
} | ||
|
||
impl Div<u64> for $TokenLamports { | ||
type Output = Result<$TokenLamports>; | ||
fn div(self, other: u64) -> Result<$TokenLamports> { | ||
self.0 | ||
.checked_div(other) | ||
.map($TokenLamports) | ||
.ok_or(ArithmeticError) | ||
} | ||
} | ||
|
||
impl Sub<$TokenLamports> for $TokenLamports { | ||
type Output = Result<$TokenLamports>; | ||
fn sub(self, other: $TokenLamports) -> Result<$TokenLamports> { | ||
self.0 | ||
.checked_sub(other.0) | ||
.map($TokenLamports) | ||
.ok_or(ArithmeticError) | ||
} | ||
} | ||
|
||
impl Add<$TokenLamports> for $TokenLamports { | ||
type Output = Result<$TokenLamports>; | ||
fn add(self, other: $TokenLamports) -> Result<$TokenLamports> { | ||
self.0 | ||
.checked_add(other.0) | ||
.map($TokenLamports) | ||
.ok_or(ArithmeticError) | ||
} | ||
} | ||
|
||
impl Sum<$TokenLamports> for Result<$TokenLamports> { | ||
fn sum<I: Iterator<Item = $TokenLamports>>(iter: I) -> Self { | ||
let mut sum = $TokenLamports(0); | ||
for item in iter { | ||
sum = (sum + item)?; | ||
} | ||
Ok(sum) | ||
} | ||
} | ||
/// Parse a numeric string as an amount of Lamports, i.e., with 9 digit precision. | ||
/// | ||
/// Note that this parses the Lamports amount divided by 10<sup>9</sup>, | ||
/// which can include a decimal point. It does not parse the number of | ||
/// Lamports! This makes this function the semi-inverse of `Display` | ||
/// (only `Display` adds the suffixes, and we do not expect that | ||
/// here). | ||
impl std::str::FromStr for $TokenLamports { | ||
type Err = &'static str; | ||
fn from_str(s: &str) -> std::result::Result<Self, Self::Err> { | ||
let mut value = 0_u64; | ||
let mut is_after_decimal = false; | ||
let mut exponent: i32 = $decimals; | ||
let mut had_digit = false; | ||
|
||
// Walk the bytes one by one, we only expect ASCII digits or '.', so bytes | ||
// suffice. We build up the value as we go, and if we get past the decimal | ||
// point, we also track how far we are past it. | ||
for ch in s.as_bytes() { | ||
match ch { | ||
b'0'..=b'9' => { | ||
value = value * 10 + ((ch - b'0') as u64); | ||
if is_after_decimal { | ||
exponent -= 1; | ||
} | ||
had_digit = true; | ||
} | ||
b'.' if !is_after_decimal => is_after_decimal = true, | ||
b'.' => return Err("Value can contain at most one '.' (decimal point)."), | ||
b'_' => { /* As a courtesy, allow numeric underscores for readability. */ } | ||
_ => return Err("Invalid value, only digits, '_', and '.' are allowed."), | ||
} | ||
|
||
if exponent < 0 { | ||
return Err("Value can contain at most 9 digits after the decimal point."); | ||
} | ||
} | ||
|
||
if !had_digit { | ||
return Err("Value must contain at least one digit."); | ||
} | ||
|
||
// If the value contained fewer than 9 digits behind the decimal point | ||
// (or no decimal point at all), scale up the value so it is measured | ||
// in lamports. | ||
while exponent > 0 { | ||
value *= 10; | ||
exponent -= 1; | ||
} | ||
|
||
Ok($TokenLamports(value)) | ||
} | ||
} | ||
}; | ||
} | ||
|
||
impl_token!(Lamports, "SOL", decimals = 9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this
at
insolido
because it proxies metrics, it has an internal polling loop that is not in sync with when Prometheus fetches the metrics. But now that we are building the metrics into Solana, we can get the latest values right at the moment that Prometheus polls us, so there is no need to add the timestamps.