Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect system data for client monitoring #5182

Merged
merged 2 commits into from
Feb 20, 2023

Conversation

nflaig
Copy link
Member

@nflaig nflaig commented Feb 20, 2023

Motivation

Follow up PR for #5037 to implement collection of system data to support full feature set of beaconcha.in monitoring app.

Description

Replaces currently hard-coded values in system stats with proper provider functions. Most of the system data is collected using systeminformation which is a widely used and well maintained package with 0 dependencies. There are a few issues running in a containerized environment as some data might be inaccurate or missing but the important data is collected correctly even there. These issues might also be addressed in the future, see sebhildebrandt/systeminformation#777, although it is really not that important.

Performance

Dashboards are already in place to keep track of the performance. The functions to collect system data are not executed in the main thread and a lot of the time use child processes which execute native system calls depending on OS. The performance measurements from the last few days looks really promising and I did not see any issues as of now both validator efficiency and system load seem unaffected by enabling client monitoring.

Added some screenshots of prometheus metrics, just let me now if I should provide further metrics.

Prometheus metrics

Takes between 0.5 - 1 second to collect system data, depending on the environment (on windows ~3 seconds).

image

It should not cause any spikes in Lodestar CPU usage as it is executed in a different process

image

Overall container CPU usage looks got to me as well, it does not seem like collecting system data consumes too much CPU

image

Event Loop Lag seems unaffected as well

image

Process memory usage looks stable

image

Container memory usage as well (no rss memory leak)

image

Beaconcha.in mobile app

We support all dashboards in the beaconcha.in mobile app and also all notifications which is the most valuable feature of the app in my opinion. A node operator gets a lot of value from the app which you normally would only get if you run prometheus + grafana and invest additional time to configure proper alerting.

Mobile app screenshots

Screenshot_20230220_135448_Beaconchain
Screenshot_20230220_141016_Beaconchain
Screenshot_20230218_212040_Beaconchain
Screenshot_20230218_212106_Beaconchain
Screenshot_20230219_232736_Beaconchain
Screenshot_20230218_212159_Beaconchain
Screenshot_20230218_212227_Beaconchain
Screenshot_20230218_212242_Beaconchain
Screenshot_20230218_212305_Beaconchain
Screenshot_20230218_212320_Beaconchain
Screenshot_20230218_212338_Beaconchain
Screenshot_20230218_212403_Beaconchain
Screenshot_20230218_212417_Beaconchain

@nflaig nflaig force-pushed the collect-system-stats branch from 3ffc338 to 1e48428 Compare February 20, 2023 14:16
@nflaig nflaig marked this pull request as ready for review February 20, 2023 14:27
@nflaig nflaig requested a review from a team as a code owner February 20, 2023 14:27
Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
More review could be done on packages/beacon-node/src/monitoring/system.ts where the majority of the functionality lives.

@wemeetagain wemeetagain merged commit e41e19b into ChainSafe:unstable Feb 20, 2023
@philknows philknows added this to the v1.5.0 milestone Feb 20, 2023
@wemeetagain
Copy link
Member

🎉 This PR is included in v1.6.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants