server: collect system CPU, RAM, Network, IO metrics #21416

robert-s-lee · 2018-01-12T13:37:08Z

FEATURE REQUEST

Monitoring of system capacity (CPU, RAM, Network, IO) is critical to maintaining a stable and performant system.

Performance

Running CPU at close to 100% utilization with high run queue will result in poor performance.
Running RAM at close to 100% utilization triggers Linux OOM and / or swapping that will result in poor performance of stability issues.
Running storage at 100% capacity causes writes to fail causing various processes to stop.
Running storage at 100% utilization read/write will causes poor service time.
Running network at 100% utilization causes response between databases and client to be poor.

Stability

Distributed system's ability to detect survival status of member systems depends on reasonable and predictable response to health-check and metadata exchange. Shortage of system resources impact performance of these tasks that could trigger pre-defined heuristics to start initiating failover scenarios.

blocker, must-have, should-have, nice-to-have

At the bare minimum, the following system capacity of the following should be tracked at 1 minute interval and retained for some amount of time.

CPU Utilization, CPU Run Queue
RAM Utilization, RAM shortage (swap)
Disk utilization, Disk queue size
Disk capacity, Disk queue size
Network utilization, Network error rate

Nice to have:

Often times, configuration of the system can change (such as adding more CPU and RAM to the server, adding more disk and network capacity). The configuration changes should also be saved that that utilization often expressed in % can be translated to absolute unit. The absolute unit is useful in capacity planning cloud migration. For example, 10% utilization on 1GHz CPU has different meaning that 10% utilization on 3GHz CPU. CockroachDB allows online migration from Cloud A to Cloud B or older machine to newer machine. Having this information allows planning of these activities based on accurate historical data.

workarounds to address this issue?

Use Linux sar or equivalent and manually tabulate and monitor with external systems.

dianasaur323 · 2018-01-16T16:06:56Z

Notes from meeting:

CPU time isn't enough, because you can't get a sense of utilization
Also need CPU run queue to see how far over 100% we are
Need to have a time series so that we can do historical analysis
If we don't have time to do this, at least document it

vilterp · 2018-03-21T04:20:44Z

#23733 is an attempt to do at least a little better on our CPU and memory metrics by breaking them out per-node. w/r/t getting a sense of utilization, if not the speed of each core we should at least get the number of cores on each machine.

vilterp · 2018-03-26T15:59:50Z

Filed #24205 re: number of cores.

petermattis · 2018-03-29T19:01:05Z

The Prometheus node_exporter tool collects node-level metrics. Conveniently, it is written in Go and we could link the libraries it uses into cockroach and export the metrics it provides.

vilterp · 2018-06-29T21:44:11Z

We're currently using https://github.com/elastic/gosigar to gather hardware metrics, which is missing some of the things requested here. It looks like we can provide most of them by switching over to https://github.com/shirou/gopsutil. Specifically:

Stat	Current State	With `gopsutil`
CPU usage	We get CPU time directly from gosigar, and compute our own percent metric	gopsutil provides same time metrics and a similar way of calculating percent
CPU cores	(gosigar doesn't provide this)	gopsutil gives you info about your system's CPUs, including speed and number of cores. It doesn't take hyperthreading into account, however, so e.g. my laptop shows up as 4 cores, although it can run 8 threads at a time. The user may have configured Docker and/or Kubernetes to not give us all available cores. gopsutil provides `RLimit` info, which may tell us how the kernel is limiting our resource usage.
Memory	gosigar provides this for a given process	gopsutil provides just about exactly the same thing for a given process.
Run queue	(gosigar doesn't provide this)	Unclear how to get this, since Go does its own scheduling but doesn't seem to expose a run queue size metric in the `runtime` package.
Disk IO (IOPS & throughput for reads & writes; queue size)	(gosigar doesn't provide this)	gopsutil provides disk stats: per-IO-device on Mac and Linux. These could be mapped back to stores, or just summed up and gathered per-node for a given process (but not per-device) stats just on Linux "IOPS in progress" (equivalent to "queue size", I think) just on Linux
Network (bytes & packets sent & received)	(gosigar doesn't provide this) However, we have a gRPC interceptor which records gRPC bytes sent and received between nodes. This is not recorded to the time series database.	gopsutil provides net stats for a given process on Linux, but just for the whole host on Mac. Design work is needed to figure out where to put these (node map? extension of latency matrix on debug page?)

Prometheus node_exporter provides some of the same stats as gopsutil, but with some layers of Prometheus stuff around getting the numbers; I'm included to use gopsutil.

Filed #27085 to track the work of implementing this for 2.1.

vilterp · 2018-08-30T18:37:35Z

Closing this since most of these are now collected. Can open issues for individual things that are missing.

robert-s-lee assigned dianasaur323 Jan 12, 2018

dianasaur323 added this to the 2.0 milestone Jan 16, 2018

couchand modified the milestones: 2.0, Later Jan 29, 2018

robert-s-lee mentioned this issue Feb 7, 2018

Capacity troubleshooting cockroachdb/docs#2461

Closed

petermattis changed the title ~~collect system CPU, RAM, Network, IO metrics~~ server: collect system CPU, RAM, Network, IO metrics Mar 29, 2018

petermattis modified the milestones: Later, 2.1 Mar 29, 2018

petermattis mentioned this issue Mar 29, 2018

ui, ts: detect, store, & show number of cores on each node #24205

Closed

jseldess mentioned this issue Apr 16, 2018

Capacity planning and troubleshooting cockroachdb/docs#2723

Closed

couchand added A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Apr 24, 2018

vilterp assigned piyush-singh and unassigned dianasaur323 May 2, 2018

vilterp mentioned this issue Jun 29, 2018

ui, server: hardware stats (tracking issue) #27085

Closed

vilterp closed this as completed Aug 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: collect system CPU, RAM, Network, IO metrics #21416

server: collect system CPU, RAM, Network, IO metrics #21416

robert-s-lee commented Jan 12, 2018

dianasaur323 commented Jan 16, 2018 •

edited

Loading

vilterp commented Mar 21, 2018

vilterp commented Mar 26, 2018

petermattis commented Mar 29, 2018

vilterp commented Jun 29, 2018 •

edited

Loading

vilterp commented Aug 30, 2018

server: collect system CPU, RAM, Network, IO metrics #21416

server: collect system CPU, RAM, Network, IO metrics #21416

Comments

robert-s-lee commented Jan 12, 2018

Performance

Stability

blocker, must-have, should-have, nice-to-have

workarounds to address this issue?

dianasaur323 commented Jan 16, 2018 • edited Loading

vilterp commented Mar 21, 2018

vilterp commented Mar 26, 2018

petermattis commented Mar 29, 2018

vilterp commented Jun 29, 2018 • edited Loading

vilterp commented Aug 30, 2018

dianasaur323 commented Jan 16, 2018 •

edited

Loading

vilterp commented Jun 29, 2018 •

edited

Loading