Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support client monitoring by a remote service (e.g. beaconcha.in) #5037

Merged
merged 33 commits into from
Feb 16, 2023

Conversation

nflaig
Copy link
Member

@nflaig nflaig commented Jan 22, 2023

Motivation

At the moment Lodestar is the only client that does not support beaconcha.in's mobile app node monitoring. This feature is quite useful for node operators to have real-time stats about their system and also get notifications if there are any issues.

Description

Adds support for pushing client and system metrics to a remote service. The implementation is based on this specification and should be service agnostic but will be initially tested against beaconcha.in.

The implementation supports all three types of client stats (beacon node, validator and system). Most of the stats are collected from existing metrics but there are also static values and other dynamic values retrieved from defined provider functions. Both the beacon node and validator support pushing client and system metrics but by default only the beacon node enables collecting system stats as in most cases both clients run on the same host.

The client monitoring is disabled by default but can be enabled by passing the --monitoring.endpoint cli flag. As monitoring relies on metrics data, it is required that metrics are enabled by supplying the --metrics flag. --metrics flag is no longer required as of #5328

lodestar beacon --monitoring.endpoint "https://beaconcha.in/api/v1/client/metrics?apikey={apikey}"

Design decisions

  • independent from metrics (metrics are just a data provider), introduce new term "monitoring"
  • declarative approach to define client stats which makes it easy to maintain and extend in the future as spec evolves
  • independent from beacon node and validator, just relies on metrics register to be injected
  • use metric name (e.g. beacon_head_slot) to select the metric (other option would be to use metric object but not all metrics are available there, e.g. nodejs metrics, libp2p metrics)
  • use good default values and only expose minimal set of options as cli args
  • assume that remote service is unreliable, properly handle timeouts, aborts and pending requests

Considerations

  • Create separate package (e.g. @lodestar/monitoring) to better decouple from beacon node and improve reuseability, could also think about a package called @lodestar/common which includes things that are reused by the beacon node and validator such as metrics and monitoring.

Problematic/missing values

  • disk_beaconchain_bytes_total is not retrievable from metrics, add metric for this? other option how this value can be retrieved? (teku hard-codes this to 0, prysm and lighthouse have a prom metric for this), metric added in Add metrics to capture beacon node and validator db size #5087 and beacon node stats updated in Add metric value for disk_beaconchain_bytes_total #5162
  • network_libp2p_bytes_total_receive uses libp2p_data_transfer_bytes_total{protocol="global received"} but this metric is not available most of the time, other metric to get this data from? (teku and prysm both hard-code this to 0, lighthouse gets it from libp2p_inbound_bytes prom metric)
  • network_libp2p_bytes_total_transmit same as above but libp2p_data_transfer_bytes_total{protocol="global sent"} is used (teku and prysm both hard-code this to 0, lighthouse gets it from libp2p_outbound_bytes prom metric)
  • validator_active uses vc_indices_count metric which is the total amount of validators but we only want active validators here (other CL clients have a metric for this, mostly use a label to differentiate between total and active) Add VC metric to track validator statuses #5158
  • sync_eth1_connected hard-coded to true (based on lodestar_execution_engine_http_client_config_urls_count, if count is above 0 this will be set to true, any way to check if connected eth1 node is synced?)
  • client_build hard-coded to 0 (lodestar does not use incremental build numbers, teku and lighthouse also hard-code this to 0)
  • sync_eth2_fallback_configured hard-coded to false (teku and prysm both hard-code this to false, lighthouse gets it from sync_eth2_fallback_configured prom metric)
  • sync_eth2_fallback_connected hard-coded to false (teku and prysm both hard-code this to false, lighthouse gets it from sync_eth2_fallback_configured prom metric)
  • sync_eth1_fallback_configured hard-coded to false (based on lodestar_execution_engine_http_client_config_urls_count, if count is above 1 a fallback url is configured and this will be set to true)
  • sync_eth1_fallback_connected hard-coded to false (teku hard-codes this to false, prysm and lighthouse have a prom metric for this, in lodestar we only track the total requests done to a fallback node with lodestar_execution_engine_http_client_request_used_fallback_url_total prom metrics but this can't really be used here)
  • slasher_active hard-coded to false (currently lodestar does not implement a slasher)

Open tasks

Closes #4666

@nflaig nflaig requested a review from a team as a code owner January 22, 2023 16:20
@nflaig nflaig marked this pull request as draft January 22, 2023 16:20
@nflaig nflaig force-pushed the monitoring branch 3 times, most recently from 17321c9 to 6141563 Compare January 23, 2023 14:46
@wemeetagain
Copy link
Member

Those network metrics can be found with:

  • libp2p_data_transfer_bytes_total{protocol="global sent"}
  • libp2p_data_transfer_bytes_total{protocol="global received"}

Disk usage, we can use this to make a new metric: https://github.com/Level/classic-level#dbapproximatesizestart-end-options-callback
(Or set to 0 for now and make this an issue)

On the active validator count, not sure yet, will dig more into this later

@nflaig
Copy link
Member Author

nflaig commented Jan 31, 2023

Thanks @wemeetagain for taking a look!

About the libp2p_data_transfer_bytes_total metrics, I already use those but I was unsure if those are the correct ones because the libp2p metrics were not collected but this is only the case for older lodestar versions and it is tracked now with the libp2p update you did in #4717

Disk usage, we can use this to make a new metric: https://github.com/Level/classic-level#dbapproximatesizestart-end-options-callback

I think that would be good to have in general as a metric, right now as an operator you have to look on OS level to find out how big the beacon node DB is

On the active validator count, not sure yet, will dig more into this later

🙏

@nflaig nflaig mentioned this pull request Feb 2, 2023
7 tasks
Only print out information which is configurable by non-hidden CLI options and properly documented.
@nflaig nflaig marked this pull request as ready for review February 13, 2023 19:45
@wemeetagain
Copy link
Member

wemeetagain commented Feb 13, 2023

Perhaps we get this to a clean mergeable point and make issues out of any remaining outstanding points (just don't want this to rot)

@nflaig
Copy link
Member Author

nflaig commented Feb 14, 2023

I haven't really worked on it in the last 2 weeks but is it ready to be merged in the current state, just requires review now.

Gathering system stats is anyways debatable, prysm for example does not implement this at the moment and just supports VC and BN stats. Will do a PoC implementation to see how this would look like in nodejs and then we can decide to include it or not, I think it some system stats such as disk size might not work in containerized environments unless some mount point are added.

The missing values are also now mostly solved, validator_active is definitely interesting for node operators to see how many of their imported validators are actually active on the beacon chain. I think makes sense to take a look at this metric independent from this PR.

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks really tidy :)

Have you tried this out yet?

@nflaig
Copy link
Member Author

nflaig commented Feb 14, 2023

Have you tried this out yet?

It works well with the data we currently gather but few features in the beaconcha.in mobile app rely on system stats.
I want to finalize the system stats implementation and then do further integration testing and analyze the performance.

@wemeetagain
Copy link
Member

🎉 This PR is included in v1.6.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Support beaconcha.in client node monitoring endpoint
3 participants