Add operating system cgroup stats to node-stats telemetry #1663

inqueue · 2023-02-01T23:41:21Z

This PR adds operating system cgroup stats to the node-stats telemetry device.

cgroup stats are enabled by default
Fields are added to Rally metrics:

    "os_cgroup_cpuacct_usage_nanos": 173025468754,
    "os_cgroup_cpu_cfs_period_micros": 100000,
    "os_cgroup_cpu_cfs_quota_micros": 800000,
    "os_cgroup_cpu_stat_number_of_elapsed_periods": 36562,
    "os_cgroup_cpu_stat_number_of_times_throttled": 9,
    "os_cgroup_cpu_stat_time_throttled_nanos": 914077285,
    "os_cgroup_memory_limit_in_bytes": 32000000000,
    "os_cgroup_memory_usage_in_bytes": 17201709056,

Note, the Nodes Stats API returns the values under os.cgroup.memory as strings and are converted to integers in this change. elastic/elasticsearch#93429 requests the API return integer types for these fields.

dliappis

Thank you for the PR.

I requested that we make this more resilient (also saw some errors in my setup).

dliappis · 2023-02-06T11:22:25Z

esrally/telemetry.py

+        # Convert strings returned by the Node Stats API for os.cgroup.memory limits
+        # https://github.com/elastic/elasticsearch/issues/93429
+        for k in ("limit_in_bytes", "usage_in_bytes"):
+            node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))})


This doesn't look safe to me. First of all I see that node-stats-include-cgroup by default is set to true, so it'll try to collect those stats always when the telemetry device is enabled.

But then you are assuming that the dict node_stats already has the keys os / group / memory here.

In the logs against a fresh 8.6.1 cluster I saw the following warning:

2023-02-06 11:02:25,752 ActorAddr-(T|:37073)/PID:439715 esrally.telemetry ERROR Could not determine node stats Traceback (most recent call last): File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 172, in run self.recorder.record() File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 850, in record collected_node_stats.update(self.os_cgroup_stats(node_name, node_stats)) File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 917, in os_cgroup_stats node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))}) ValueError: invalid literal for int() with base 10: 'max'

When I did some test runs against a real metric store I didn't find any os cgroup related stats (target ES running in docker 8.6.1).

When we aren't sure that a dict has keys present, we should use the setdefault() method.

Would you mind to share the output of GET /_nodes/stats?filter_path=**.os.cgroup from your 8.6.1 cluster? I did not encounter any values named max, though it looks like ES will try fetching it in the form of memory.max from cgroup fs along with memory.current.

I'll need to rework this and fix the tests.

Would you mind to share the output of GET /_nodes/stats?filter_path=**.os.cgroup from your 8.6.1 cluster? I did not encounter any values named max, though it looks like ES will try fetching it in the form of memory.max from cgroup fs along with memory.current.

sure thing. Here you go:

{ "nodes" : { "yPfEvz0MSHS1A9Oq4YBWzQ" : { "os" : { "cgroup" : { "cpuacct" : { "control_group" : "/", "usage_nanos" : 62952671 }, "cpu" : { "control_group" : "/", "cfs_period_micros" : 100000, "cfs_quota_micros" : -1, "stat" : { "number_of_elapsed_periods" : 0, "number_of_times_throttled" : 0, "time_throttled_nanos" : 0 } }, "memory" : { "control_group" : "/", "limit_in_bytes" : "max", "usage_in_bytes" : "35197526016" } } } } } }

inqueue · 2023-02-08T16:56:39Z

This PR has been converted to a draft and placed on hold until serverless integration checkpoint one is finished. It is enough, for now, to visualize container-level metrics using readily available CSP dashboards for GKE clusters and subordinate resources.

inqueue · 2023-03-06T21:14:35Z

@dliappis @b-deam This PR is, again, ready for review. Would you mind taking another pass?

b-deam

In the PR you state that cgroup stats are enabled by default, but the docs + code show that they aren't. I think we should include them by default, but only log a single warning if they aren't available (rather than on every attempt).

I did a quick comparison of the os_cgroup_cpuacct_usage_nanos vs what we collect in Elastic Cloud and can confirm they're accurate:

dliappis · 2023-03-07T16:01:08Z

In the PR you state that cgroup stats are enabled by default, but the docs + code show that they aren't. I think we should include them by default,

We discussed this offline with @inqueue and since pretty much everything nowadays runs under a cgroup, I agree, let's set the default to enabled.

dliappis

LGTM (and let's change the default to true)

inqueue · 2023-03-08T21:52:25Z

I think we should include them by default, but only log a single warning if they aren't available (rather than on every attempt).

@b-deam It looks like in order to log just a single line it would need to collect a node stats sample and check if the key exists there inside __init__(). Rather than get into that, I switched logging to debug to keep the noise down in an already busy log.

b-deam · 2023-03-08T23:01:33Z

I think we should include them by default, but only log a single warning if they aren't available (rather than on every attempt).

@b-deam It looks like in order to log just a single line it would need to collect a node stats sample and check if the key exists there inside __init__(). Rather than get into that, I switched logging to debug to keep the noise down in an already busy log.

I think that's an even better idea, upon sleeping on it my suggestion probably wasn't the best idea.

inqueue added 5 commits February 1, 2023 13:35

Fetch OS cgroup stats in node-stats telemetry

820558c

Convert os.cgroup.memory limits to int

1f3333d

Update tests

89a430c

Fix black

4d7f02f

Add node-stats-include-cgroup to docs

7faa79c

inqueue added the enhancement Improves the status quo label Feb 1, 2023

inqueue requested review from dliappis and b-deam February 1, 2023 23:41

inqueue self-assigned this Feb 2, 2023

dliappis requested changes Feb 6, 2023

View reviewed changes

inqueue added 3 commits February 6, 2023 17:01

Refactor cgroup stats collection

e832a2c

Fix linting error

9a5f45b

Resolve test failures

fabaa68

inqueue marked this pull request as draft February 8, 2023 16:50

inqueue added 2 commits March 2, 2023 14:37

Merge branch 'master' into telemetry-os-cgroup-stats

265b0b1

Simplify cgroup collection and add it to tests

9a42ad7

inqueue force-pushed the telemetry-os-cgroup-stats branch from 2dbc27e to 9a42ad7 Compare March 6, 2023 20:27

Fix formatting

9f62d3a

inqueue marked this pull request as ready for review March 6, 2023 21:11

dliappis self-requested a review March 7, 2023 06:19

b-deam reviewed Mar 7, 2023

View reviewed changes

dliappis approved these changes Mar 7, 2023

View reviewed changes

inqueue added 2 commits March 8, 2023 16:37

Enable collection of os cgroup stats by default

7da6173

Remove extra line

91fda49

b-deam approved these changes Mar 8, 2023

View reviewed changes

inqueue merged commit 47b514f into elastic:master Mar 9, 2023

inqueue deleted the telemetry-os-cgroup-stats branch March 9, 2023 00:45

gbanasiak added this to the 2.8.0 milestone Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add operating system cgroup stats to node-stats telemetry #1663

Add operating system cgroup stats to node-stats telemetry #1663

inqueue commented Feb 1, 2023

dliappis left a comment

dliappis Feb 6, 2023

inqueue Feb 6, 2023

dliappis Feb 7, 2023

inqueue commented Feb 8, 2023

inqueue commented Mar 6, 2023

b-deam left a comment

dliappis commented Mar 7, 2023

dliappis left a comment

inqueue commented Mar 8, 2023

b-deam commented Mar 8, 2023

Add operating system cgroup stats to node-stats telemetry #1663

Add operating system cgroup stats to node-stats telemetry #1663

Conversation

inqueue commented Feb 1, 2023

dliappis left a comment

Choose a reason for hiding this comment

dliappis Feb 6, 2023

Choose a reason for hiding this comment

inqueue Feb 6, 2023

Choose a reason for hiding this comment

dliappis Feb 7, 2023

Choose a reason for hiding this comment

inqueue commented Feb 8, 2023

inqueue commented Mar 6, 2023

b-deam left a comment

Choose a reason for hiding this comment

dliappis commented Mar 7, 2023

dliappis left a comment

Choose a reason for hiding this comment

inqueue commented Mar 8, 2023

b-deam commented Mar 8, 2023