-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: collect system CPU, RAM, Network, IO metrics #21416
Comments
Notes from meeting:
|
#23733 is an attempt to do at least a little better on our CPU and memory metrics by breaking them out per-node. w/r/t getting a sense of utilization, if not the speed of each core we should at least get the number of cores on each machine. |
Filed #24205 re: number of cores. |
The Prometheus |
We're currently using https://github.com/elastic/gosigar to gather hardware metrics, which is missing some of the things requested here. It looks like we can provide most of them by switching over to https://github.com/shirou/gopsutil. Specifically:
Prometheus node_exporter provides some of the same stats as gopsutil, but with some layers of Prometheus stuff around getting the numbers; I'm included to use Filed #27085 to track the work of implementing this for 2.1. |
Closing this since most of these are now collected. Can open issues for individual things that are missing. |
FEATURE REQUEST
Monitoring of system capacity (CPU, RAM, Network, IO) is critical to maintaining a stable and performant system.
Performance
Stability
Distributed system's ability to detect survival status of member systems depends on reasonable and predictable response to health-check and metadata exchange. Shortage of system resources impact performance of these tasks that could trigger pre-defined heuristics to start initiating failover scenarios.
blocker, must-have, should-have, nice-to-have
At the bare minimum, the following system capacity of the following should be tracked at 1 minute interval and retained for some amount of time.
Nice to have:
Often times, configuration of the system can change (such as adding more CPU and RAM to the server, adding more disk and network capacity). The configuration changes should also be saved that that utilization often expressed in % can be translated to absolute unit. The absolute unit is useful in capacity planning cloud migration. For example, 10% utilization on 1GHz CPU has different meaning that 10% utilization on 3GHz CPU. CockroachDB allows online migration from Cloud A to Cloud B or older machine to newer machine. Having this information allows planning of these activities based on accurate historical data.
workarounds to address this issue?
Use Linux sar or equivalent and manually tabulate and monitor with external systems.
The text was updated successfully, but these errors were encountered: