Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui, ts: detect, store, & show number of cores on each node #24205

Closed
vilterp opened this issue Mar 26, 2018 · 13 comments
Closed

ui, ts: detect, store, & show number of cores on each node #24205

vilterp opened this issue Mar 26, 2018 · 13 comments
Labels
A-monitoring A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@vilterp
Copy link
Contributor

vilterp commented Mar 26, 2018

Currently, the UI shows CPU usage in various places (e.g. the cluster visualization) but doesn't show the number of available cores anywhere. Thus, one can't know how much CPU is available without going going elsewhere (e.g. the GCP or AWS console).

We should track the number of cores each node has, and show that somehow near CPU usage indicators.

This raises a bit of a design question: how do you notate this? The clusterviz shows CPU usage as a percentage, which is often above 100, since it's really cpu seconds / second * 100. Do we normalize by number of cores to make it a true percentage?

cc @mrtracy @couchand

@bdarnell
Copy link
Contributor

We should also consider the possibility that we're not the only thing on the machine. If a cockroach process is running alongside some other application process on a four-core machine and they're each using half the available CPU, we need to show both that cockroach is using 2 cpu s/s and that the machine as a whole is at 100% cpu utilization. We should collect two new timeseries: total CPUs and idle CPU time.

@vilterp
Copy link
Contributor Author

vilterp commented Mar 26, 2018

Note from in-person convo with @bdarnell: Docker & Kubernetes make it more complicated to know how much CPU is available, since they have concepts of allocating a certain amount of CPU to each process.

Not sure how this fits together or how to get a number on how much CPU is available; do you know @a-robinson?

@couchand
Copy link
Contributor

It's a good idea to figure out if we can get this accurately, and if we can we should incorporate it into the UI to add useful context.

But I'm a 👎 on reporting a CPU utilization as a percentage where the denominator is the total CPU available, because of how common it is to report CPU as 100% == single core fully utilized, with multi-core figures going over 100% regularly.

@vilterp
Copy link
Contributor Author

vilterp commented Mar 26, 2018

Agreed. Maybe we want to show the fraction (<usage> cpu s/s) / (<total> cpu s/s)?

@a-robinson
Copy link
Contributor

Note from in-person convo with @bdarnell: Docker & Kubernetes make it more complicated to know how much CPU is available, since they have concepts of allocating a certain amount of CPU to each process.

Not sure how this fits together or how to get a number on how much CPU is available; do you know @a-robinson?

Yeah, it doesn't mean that there isn't a way, but I'm not aware of a way to tell what fraction of the machine's cores are considered usable by the container in the common case. There's also the fact that the kernel can be configured to treat a container's cpu limit as a hard cap (i.e. always enforce it) or just a soft cap (only enforce it if the machine is busy).

If a container is restricted to use only certain CPUs then that will be reflected inside the container, but that's not the mechanism that Kubernetes uses to restrict container CPU usage.

@petermattis
Copy link
Collaborator

If possible, we should indicate in the UI if cockroach is restricted to a fraction of a CPU.

@petermattis petermattis added this to the 2.1 milestone Mar 29, 2018
@petermattis
Copy link
Collaborator

See #21416 (comment). We should investigate using some of the node_exporter libraries. Our Prometheus/Grafana configs seem to provide a reasonable CPU metric.

@couchand couchand added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. A-monitoring labels Apr 24, 2018
@vilterp
Copy link
Contributor Author

vilterp commented Jun 1, 2018

I don't see anything in the node_exporter libraries that gets the number of cores, just reports seconds used by the process, like what we already have from gosigar.

I also haven't been able to find a way to know from inside of a docker container how much CPU is available. In the absence of such an API, maybe we should just call Go's runtime.NumCPU(), report that in the nodes table, and document in a tooltip (and in the docs) that you might not actually be able to use all those CPUs?

@petermattis
Copy link
Collaborator

I don't see anything in the node_exporter libraries that gets the number of cores, just reports seconds used by the process, like what we already have from gosigar.

Looks like node_exporter gets its cpu metrics from https://github.com/prometheus/procfs/blob/master/stat.go#L62.

@vilterp
Copy link
Contributor Author

vilterp commented Jun 1, 2018

I'm not sure if that reflects limitations imposed by docker/k8s, though.

Also, looks like gosigar has a similar CPU list API: https://github.com/cloudfoundry/gosigar/blob/master/sigar_interface.go#L69-L71

And a ProcCPU API (new since the version we're using): https://github.com/cloudfoundry/gosigar/blob/master/sigar_interface.go#L136-L141

The meaning of these APIs isn't well documented. Will play around with them when the admin UI team gets to our "improve hardware stats" milestone at the end of June.

@petermattis
Copy link
Collaborator

In addition to better CPU metrics, I'd really like disk and network stats as well (e.g. disk utilization, network bandwidth).

@a-robinson
Copy link
Contributor

I'm not sure if that reflects limitations imposed by docker/k8s, though.

In most cases, it does not, because those limitations are typically implemented via CFS quota. It's not easy to do anything about that, though. And in our default Kubernetes configs, we don't actually set limits on CPU (or recommend doing so) because of the effect it can have on tail latencies.

@vilterp
Copy link
Contributor Author

vilterp commented Aug 30, 2018

Closing this because we now report the number of CPUs on the machine in the node list; opened #29366 to track more specific docker throttling issue.

@vilterp vilterp closed this as completed Aug 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-monitoring A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants