Support client monitoring by a remote service (e.g. beaconcha.in) #5037

nflaig · 2023-01-22T16:20:44Z

Motivation

At the moment Lodestar is the only client that does not support beaconcha.in's mobile app node monitoring. This feature is quite useful for node operators to have real-time stats about their system and also get notifications if there are any issues.

Description

Adds support for pushing client and system metrics to a remote service. The implementation is based on this specification and should be service agnostic but will be initially tested against beaconcha.in.

The implementation supports all three types of client stats (beacon node, validator and system). Most of the stats are collected from existing metrics but there are also static values and other dynamic values retrieved from defined provider functions. Both the beacon node and validator support pushing client and system metrics but by default only the beacon node enables collecting system stats as in most cases both clients run on the same host.

The client monitoring is disabled by default but can be enabled by passing the --monitoring.endpoint cli flag. ~~As monitoring relies on metrics data, it is required that metrics are enabled by supplying the --metrics flag.~~ --metrics flag is no longer required as of #5328

lodestar beacon --monitoring.endpoint "https://beaconcha.in/api/v1/client/metrics?apikey={apikey}"

Design decisions

independent from metrics (metrics are just a data provider), introduce new term "monitoring"
declarative approach to define client stats which makes it easy to maintain and extend in the future as spec evolves
independent from beacon node and validator, just relies on metrics register to be injected
use metric name (e.g. beacon_head_slot) to select the metric (other option would be to use metric object but not all metrics are available there, e.g. nodejs metrics, libp2p metrics)
use good default values and only expose minimal set of options as cli args
assume that remote service is unreliable, properly handle timeouts, aborts and pending requests

Considerations

Create separate package (e.g. @lodestar/monitoring) to better decouple from beacon node and improve reuseability, could also think about a package called @lodestar/common which includes things that are reused by the beacon node and validator such as metrics and monitoring.

Problematic/missing values

Open tasks

Collect system stats, internally with systeminformation ~~vs scrape from node exporter~~, implemented in Collect system data for client monitoring #5182
Update Lodestar documentation, add separate page for client monitoring
Create issue to track updates required in other repositories (e.g. eth-docker, eth2-client-metrics, etc.), Client monitoring integration tracker #5095
Add unit tests (data validation, misconfiguration handling, other failure scenarios, e.g. remote service offline)
Ensure that client stats are collected correctly from metrics or system info and sent in the right format, see Mobile app screenshots in Collect system data for client monitoring #5182
Add metrics to monitoring service. Track failed requests, request duration, time spent collecting data, etc.
Add dashboard panel to visualize monitoring service metrics

Closes #4666

Use "process_cpu_user_seconds_total" instead of "process_cpu_seconds_total" as it properly reports the CPU usage of the process and more importantly also triggers the collect method of the metric which ensures that always the latest value is shown.

wemeetagain · 2023-01-30T23:44:07Z

Those network metrics can be found with:

libp2p_data_transfer_bytes_total{protocol="global sent"}
libp2p_data_transfer_bytes_total{protocol="global received"}

Disk usage, we can use this to make a new metric: https://github.com/Level/classic-level#dbapproximatesizestart-end-options-callback
(Or set to 0 for now and make this an issue)

On the active validator count, not sure yet, will dig more into this later

nflaig · 2023-01-31T09:38:42Z

Thanks @wemeetagain for taking a look!

About the libp2p_data_transfer_bytes_total metrics, I already use those but I was unsure if those are the correct ones because the libp2p metrics were not collected but this is only the case for older lodestar versions and it is tracked now with the libp2p update you did in #4717

Disk usage, we can use this to make a new metric: https://github.com/Level/classic-level#dbapproximatesizestart-end-options-callback

I think that would be good to have in general as a metric, right now as an operator you have to look on OS level to find out how big the beacon node DB is

On the active validator count, not sure yet, will dig more into this later

🙏

…igured

Only print out information which is configurable by non-hidden CLI options and properly documented.

wemeetagain · 2023-02-13T22:07:02Z

Perhaps we get this to a clean mergeable point and make issues out of any remaining outstanding points (just don't want this to rot)

nflaig · 2023-02-14T07:32:54Z

I haven't really worked on it in the last 2 weeks but is it ready to be merged in the current state, just requires review now.

Gathering system stats is anyways debatable, prysm for example does not implement this at the moment and just supports VC and BN stats. Will do a PoC implementation to see how this would look like in nodejs and then we can decide to include it or not, I think it some system stats such as disk size might not work in containerized environments unless some mount point are added.

The missing values are also now mostly solved, validator_active is definitely interesting for node operators to see how many of their imported validators are actually active on the beacon chain. I think makes sense to take a look at this metric independent from this PR.

wemeetagain

This PR looks really tidy :)

Have you tried this out yet?

nflaig · 2023-02-14T18:51:07Z

Have you tried this out yet?

It works well with the data we currently gather but few features in the beaconcha.in mobile app rely on system stats.
I want to finalize the system stats implementation and then do further integration testing and analyze the performance.

wemeetagain · 2023-03-21T16:51:11Z

🎉 This PR is included in v1.6.0 🎉

nflaig requested a review from a team as a code owner January 22, 2023 16:20

nflaig marked this pull request as draft January 22, 2023 16:20

nflaig force-pushed the monitoring branch 3 times, most recently from 17321c9 to 6141563 Compare January 23, 2023 14:46

nflaig added 20 commits January 28, 2023 14:43

Initial client monitoring implementation

5490e12

Add monitoring to beacon node

7542678

Add monitoring to validator

77c6869

Ensure that monitoring endpoint is a valid URL

59c23dd

Improve validation of monitoring endpoint

ff8b9a0

Improve error handling and timeout of remote server request

f844328

Wait for pending request before sending next one

65df766

Update request error handling

32f7c9f

Update monitoring endpoint parsing

c114aae

Export monitoring package

3a58102

Fix process cpu seconds total metric

5abd81f

Use "process_cpu_user_seconds_total" instead of "process_cpu_seconds_total" as it properly reports the CPU usage of the process and more importantly also triggers the collect method of the metric which ensures that always the latest value is shown.

Add option to collect system stats

585918d

Define system stats

7dda420

Improve logs when monitoring service is started

b019303

Use the term "remote service" instead of "remote server"

ae1d055

Move Client type to service

d074394

Add monitoring args to beacon node test

0d84d13

Update description of monitoring cli args

b1a38d4

Update monitoring service

bc4b92f

Add metrics for collecting and sending data

92ed7c0

nflaig force-pushed the monitoring branch from e6fdf34 to 01647dd Compare January 28, 2023 13:46

nflaig added 3 commits January 28, 2023 14:53

Add monitoring panels to VM + host dashboard

03bca57

Update send data metric buckets

e9285d3

Print out machine when starting monitoring service

a60b1d4

nflaig force-pushed the monitoring branch from 01647dd to 7a53a3e Compare January 28, 2023 14:40

nflaig mentioned this pull request Jan 31, 2023

Add metrics to capture beacon node and validator db size #5087

Merged

nflaig added 5 commits February 1, 2023 12:18

Add unit tests

e438f43

Add metric values for sync_eth1_connected and sync_eth1_fallback_conf…

e785170

…igured

Use setTimeout instead of sleep for initial delay

0cd1e78

Use milliseconds instead of seconds for time values

f29592b

Add description to client stats properties

bd78a41

nflaig force-pushed the monitoring branch from 130f899 to bd78a41 Compare February 1, 2023 11:21

nflaig added 2 commits February 1, 2023 14:43

Remove sinon spies after tests are finished

5a73255

Document client monitoring usage

bd8fc4f

nflaig mentioned this pull request Feb 2, 2023

Client monitoring integration tracker #5095

Closed

7 tasks

Add enum to check status of monitoring service

e8f7fd5

nflaig force-pushed the monitoring branch from e7f2aaf to e8f7fd5 Compare February 5, 2023 17:26

Reduce info log when monitoring service is started

914a3b6

Only print out information which is configurable by non-hidden CLI options and properly documented.

nflaig marked this pull request as ready for review February 13, 2023 19:45

wemeetagain approved these changes Feb 14, 2023

View reviewed changes

wemeetagain merged commit e6eb3bd into ChainSafe:unstable Feb 16, 2023

This was referenced Feb 17, 2023

Add metric value for disk_beaconchain_bytes_total #5162

Merged

Refactor client monitoring #5183

Merged

Collect system data for client monitoring #5182

Merged

philknows added this to the v1.5.0 milestone Feb 20, 2023

nflaig mentioned this pull request Feb 21, 2023

Fix client monitoring rate limit errors #5189

Merged

nflaig modified the milestones: v1.5.0, v1.6.0 Mar 4, 2023

philknows mentioned this pull request Jun 7, 2023

Add Nico Flaig (Lodestar) protocolguild/documentation#85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support client monitoring by a remote service (e.g. beaconcha.in) #5037

Support client monitoring by a remote service (e.g. beaconcha.in) #5037

nflaig commented Jan 22, 2023 •

edited

Loading

wemeetagain commented Jan 30, 2023

nflaig commented Jan 31, 2023 •

edited

Loading

wemeetagain commented Feb 13, 2023 •

edited

Loading

nflaig commented Feb 14, 2023 •

edited

Loading

wemeetagain left a comment

nflaig commented Feb 14, 2023

wemeetagain commented Mar 21, 2023

Support client monitoring by a remote service (e.g. beaconcha.in) #5037

Support client monitoring by a remote service (e.g. beaconcha.in) #5037

Conversation

nflaig commented Jan 22, 2023 • edited Loading

Motivation

Description

Design decisions

Considerations

Problematic/missing values

Open tasks

wemeetagain commented Jan 30, 2023

nflaig commented Jan 31, 2023 • edited Loading

wemeetagain commented Feb 13, 2023 • edited Loading

nflaig commented Feb 14, 2023 • edited Loading

wemeetagain left a comment

Choose a reason for hiding this comment

nflaig commented Feb 14, 2023

wemeetagain commented Mar 21, 2023

nflaig commented Jan 22, 2023 •

edited

Loading

nflaig commented Jan 31, 2023 •

edited

Loading

wemeetagain commented Feb 13, 2023 •

edited

Loading

nflaig commented Feb 14, 2023 •

edited

Loading