Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: collect system CPU, RAM, Network, IO metrics #21416

Closed
robert-s-lee opened this issue Jan 12, 2018 · 6 comments
Closed

server: collect system CPU, RAM, Network, IO metrics #21416

robert-s-lee opened this issue Jan 12, 2018 · 6 comments
Assignees
Labels
A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@robert-s-lee
Copy link
Contributor

FEATURE REQUEST

Monitoring of system capacity (CPU, RAM, Network, IO) is critical to maintaining a stable and performant system.

Performance

  • Running CPU at close to 100% utilization with high run queue will result in poor performance.
  • Running RAM at close to 100% utilization triggers Linux OOM and / or swapping that will result in poor performance of stability issues.
  • Running storage at 100% capacity causes writes to fail causing various processes to stop.
  • Running storage at 100% utilization read/write will causes poor service time.
  • Running network at 100% utilization causes response between databases and client to be poor.

Stability

Distributed system's ability to detect survival status of member systems depends on reasonable and predictable response to health-check and metadata exchange. Shortage of system resources impact performance of these tasks that could trigger pre-defined heuristics to start initiating failover scenarios.

blocker, must-have, should-have, nice-to-have

At the bare minimum, the following system capacity of the following should be tracked at 1 minute interval and retained for some amount of time.

  • CPU Utilization, CPU Run Queue
  • RAM Utilization, RAM shortage (swap)
  • Disk utilization, Disk queue size
  • Disk capacity, Disk queue size
  • Network utilization, Network error rate

Nice to have:

Often times, configuration of the system can change (such as adding more CPU and RAM to the server, adding more disk and network capacity). The configuration changes should also be saved that that utilization often expressed in % can be translated to absolute unit. The absolute unit is useful in capacity planning cloud migration. For example, 10% utilization on 1GHz CPU has different meaning that 10% utilization on 3GHz CPU. CockroachDB allows online migration from Cloud A to Cloud B or older machine to newer machine. Having this information allows planning of these activities based on accurate historical data.

workarounds to address this issue?

Use Linux sar or equivalent and manually tabulate and monitor with external systems.

@dianasaur323
Copy link
Contributor

dianasaur323 commented Jan 16, 2018

Notes from meeting:

  • CPU time isn't enough, because you can't get a sense of utilization
  • Also need CPU run queue to see how far over 100% we are
  • Need to have a time series so that we can do historical analysis
  • If we don't have time to do this, at least document it

@dianasaur323 dianasaur323 added this to the 2.0 milestone Jan 16, 2018
@couchand couchand modified the milestones: 2.0, Later Jan 29, 2018
@vilterp
Copy link
Contributor

vilterp commented Mar 21, 2018

#23733 is an attempt to do at least a little better on our CPU and memory metrics by breaking them out per-node. w/r/t getting a sense of utilization, if not the speed of each core we should at least get the number of cores on each machine.

@vilterp
Copy link
Contributor

vilterp commented Mar 26, 2018

Filed #24205 re: number of cores.

@petermattis petermattis changed the title collect system CPU, RAM, Network, IO metrics server: collect system CPU, RAM, Network, IO metrics Mar 29, 2018
@petermattis petermattis modified the milestones: Later, 2.1 Mar 29, 2018
@petermattis
Copy link
Collaborator

The Prometheus node_exporter tool collects node-level metrics. Conveniently, it is written in Go and we could link the libraries it uses into cockroach and export the metrics it provides.

@couchand couchand added A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Apr 24, 2018
@vilterp vilterp assigned piyush-singh and unassigned dianasaur323 May 2, 2018
@vilterp
Copy link
Contributor

vilterp commented Jun 29, 2018

We're currently using https://github.com/elastic/gosigar to gather hardware metrics, which is missing some of the things requested here. It looks like we can provide most of them by switching over to https://github.com/shirou/gopsutil. Specifically:

Stat Current State With gopsutil
CPU usage We get CPU time directly from gosigar, and compute our own percent metric gopsutil provides same time metrics and a similar way of calculating percent
CPU cores (gosigar doesn't provide this) gopsutil gives you info about your system's CPUs, including speed and number of cores. It doesn't take hyperthreading into account, however, so e.g. my laptop shows up as 4 cores, although it can run 8 threads at a time.

The user may have configured Docker and/or Kubernetes to not give us all available cores. gopsutil provides RLimit info, which may tell us how the kernel is limiting our resource usage.
Memory gosigar provides this for a given process gopsutil provides just about exactly the same thing for a given process.
Run queue (gosigar doesn't provide this) Unclear how to get this, since Go does its own scheduling but doesn't seem to expose a run queue size metric in the runtime package.
Disk IO (IOPS & throughput for reads & writes; queue size) (gosigar doesn't provide this) gopsutil provides disk stats:

  • per-IO-device on Mac and Linux. These could be mapped back to stores, or just summed up and gathered per-node
  • for a given process (but not per-device) stats just on Linux
  • "IOPS in progress" (equivalent to "queue size", I think) just on Linux
Network (bytes & packets sent & received) (gosigar doesn't provide this)

However, we have a gRPC interceptor which records gRPC bytes sent and received between nodes. This is not recorded to the time series database.
gopsutil provides net stats for a given process on Linux, but just for the whole host on Mac.

Design work is needed to figure out where to put these (node map? extension of latency matrix on debug page?)

Prometheus node_exporter provides some of the same stats as gopsutil, but with some layers of Prometheus stuff around getting the numbers; I'm included to use gopsutil.

Filed #27085 to track the work of implementing this for 2.1.

@vilterp
Copy link
Contributor

vilterp commented Aug 30, 2018

Closing this since most of these are now collected. Can open issues for individual things that are missing.

@vilterp vilterp closed this as completed Aug 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

6 participants