-
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support client monitoring by a remote service (e.g. beaconcha.in) #5037
Conversation
17321c9
to
6141563
Compare
Use "process_cpu_user_seconds_total" instead of "process_cpu_seconds_total" as it properly reports the CPU usage of the process and more importantly also triggers the collect method of the metric which ensures that always the latest value is shown.
Those network metrics can be found with:
Disk usage, we can use this to make a new metric: https://github.com/Level/classic-level#dbapproximatesizestart-end-options-callback On the active validator count, not sure yet, will dig more into this later |
Thanks @wemeetagain for taking a look! About the
I think that would be good to have in general as a metric, right now as an operator you have to look on OS level to find out how big the beacon node DB is
🙏 |
Only print out information which is configurable by non-hidden CLI options and properly documented.
Perhaps we get this to a clean mergeable point and make issues out of any remaining outstanding points (just don't want this to rot) |
I haven't really worked on it in the last 2 weeks but is it ready to be merged in the current state, just requires review now. Gathering system stats is anyways debatable, prysm for example does not implement this at the moment and just supports VC and BN stats. Will do a PoC implementation to see how this would look like in nodejs and then we can decide to include it or not, I think it some system stats such as disk size might not work in containerized environments unless some mount point are added. The missing values are also now mostly solved, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks really tidy :)
Have you tried this out yet?
It works well with the data we currently gather but few features in the beaconcha.in mobile app rely on system stats. |
🎉 This PR is included in v1.6.0 🎉 |
Motivation
At the moment Lodestar is the only client that does not support beaconcha.in's mobile app node monitoring. This feature is quite useful for node operators to have real-time stats about their system and also get notifications if there are any issues.
Description
Adds support for pushing client and system metrics to a remote service. The implementation is based on this specification and should be service agnostic but will be initially tested against beaconcha.in.
The implementation supports all three types of client stats (beacon node, validator and system). Most of the stats are collected from existing metrics but there are also static values and other dynamic values retrieved from defined provider functions. Both the beacon node and validator support pushing client and system metrics but by default only the beacon node enables collecting system stats as in most cases both clients run on the same host.
The client monitoring is disabled by default but can be enabled by passing the
--monitoring.endpoint
cli flag.As monitoring relies on metrics data, it is required that metrics are enabled by supplying the--metrics
flag.--metrics
flag is no longer required as of #5328lodestar beacon --monitoring.endpoint "https://beaconcha.in/api/v1/client/metrics?apikey={apikey}"
Design decisions
beacon_head_slot
) to select the metric (other option would be to use metric object but not all metrics are available there, e.g. nodejs metrics, libp2p metrics)Considerations
@lodestar/monitoring
) to better decouple from beacon node and improve reuseability, could also think about a package called@lodestar/common
which includes things that are reused by the beacon node and validator such as metrics and monitoring.Problematic/missing values
disk_beaconchain_bytes_total
is not retrievable from metrics, add metric for this? other option how this value can be retrieved? (teku hard-codes this to0
, prysm and lighthouse have a prom metric for this), metric added in Add metrics to capture beacon node and validator db size #5087 and beacon node stats updated in Add metric value fordisk_beaconchain_bytes_total
#5162network_libp2p_bytes_total_receive
useslibp2p_data_transfer_bytes_total{protocol="global received"}
but this metric is not available most of the time, other metric to get this data from?(teku and prysm both hard-code this to0
, lighthouse gets it fromlibp2p_inbound_bytes
prom metric)network_libp2p_bytes_total_transmit
same as above butlibp2p_data_transfer_bytes_total{protocol="global sent"}
is used (teku and prysm both hard-code this to0
, lighthouse gets it fromlibp2p_outbound_bytes
prom metric)validator_active
usesvc_indices_count
metric which is the total amount of validators but we only want active validators here (other CL clients have a metric for this, mostly use a label to differentiate between total and active) Add VC metric to track validator statuses #5158sync_eth1_connected
hard-coded to(based ontrue
lodestar_execution_engine_http_client_config_urls_count
, if count is above 0 this will be set totrue
, any way to check if connected eth1 node is synced?)client_build
hard-coded to0
(lodestar does not use incremental build numbers, teku and lighthouse also hard-code this to0
)sync_eth2_fallback_configured
hard-coded tofalse
(teku and prysm both hard-code this tofalse
, lighthouse gets it fromsync_eth2_fallback_configured
prom metric)sync_eth2_fallback_connected
hard-coded tofalse
(teku and prysm both hard-code this tofalse
, lighthouse gets it fromsync_eth2_fallback_configured
prom metric)sync_eth1_fallback_configured
hard-coded to(based onfalse
lodestar_execution_engine_http_client_config_urls_count
, if count is above 1 a fallback url is configured and this will be set totrue
)sync_eth1_fallback_connected
hard-coded tofalse
(teku hard-codes this tofalse
, prysm and lighthouse have a prom metric for this, in lodestar we only track the total requests done to a fallback node withlodestar_execution_engine_http_client_request_used_fallback_url_total
prom metrics but this can't really be used here)slasher_active
hard-coded tofalse
(currently lodestar does not implement a slasher)Open tasks
vs scrape from node exporter, implemented in Collect system data for client monitoring #5182Closes #4666