-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add operating system cgroup stats to node-stats telemetry #1663
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
I requested that we make this more resilient (also saw some errors in my setup).
esrally/telemetry.py
Outdated
# Convert strings returned by the Node Stats API for os.cgroup.memory limits | ||
# https://github.com/elastic/elasticsearch/issues/93429 | ||
for k in ("limit_in_bytes", "usage_in_bytes"): | ||
node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look safe to me. First of all I see that node-stats-include-cgroup
by default is set to true
, so it'll try to collect those stats always when the telemetry device is enabled.
But then you are assuming that the dict node_stats
already has the keys os
/ group
/ memory
here.
In the logs against a fresh 8.6.1
cluster I saw the following warning:
2023-02-06 11:02:25,752 ActorAddr-(T|:37073)/PID:439715 esrally.telemetry ERROR Could not determine node stats
Traceback (most recent call last):
File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 172, in run
self.recorder.record()
File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 850, in record
collected_node_stats.update(self.os_cgroup_stats(node_name, node_stats))
File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 917, in os_cgroup_stats
node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))})
ValueError: invalid literal for int() with base 10: 'max'
When I did some test runs against a real metric store I didn't find any os cgroup related stats (target ES running in docker 8.6.1).
When we aren't sure that a dict has keys present, we should use the setdefault() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to share the output of GET /_nodes/stats?filter_path=**.os.cgroup
from your 8.6.1 cluster? I did not encounter any values named max
, though it looks like ES will try fetching it in the form of memory.max
from cgroup fs along with memory.current
.
I'll need to rework this and fix the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to share the output of
GET /_nodes/stats?filter_path=**.os.cgroup
from your 8.6.1 cluster? I did not encounter any values namedmax
, though it looks like ES will try fetching it in the form ofmemory.max
from cgroup fs along withmemory.current
.
sure thing. Here you go:
{
"nodes" : {
"yPfEvz0MSHS1A9Oq4YBWzQ" : {
"os" : {
"cgroup" : {
"cpuacct" : {
"control_group" : "/",
"usage_nanos" : 62952671
},
"cpu" : {
"control_group" : "/",
"cfs_period_micros" : 100000,
"cfs_quota_micros" : -1,
"stat" : {
"number_of_elapsed_periods" : 0,
"number_of_times_throttled" : 0,
"time_throttled_nanos" : 0
}
},
"memory" : {
"control_group" : "/",
"limit_in_bytes" : "max",
"usage_in_bytes" : "35197526016"
}
}
}
}
}
}
This PR has been converted to a draft and placed on hold until serverless integration checkpoint one is finished. It is enough, for now, to visualize container-level metrics using readily available CSP dashboards for GKE clusters and subordinate resources. |
2dbc27e
to
9a42ad7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR you state that cgroup stats are enabled by default, but the docs + code show that they aren't. I think we should include them by default, but only log a single warning if they aren't available (rather than on every attempt).
I did a quick comparison of the os_cgroup_cpuacct_usage_nanos
vs what we collect in Elastic Cloud and can confirm they're accurate:
We discussed this offline with @inqueue and since pretty much everything nowadays runs under a cgroup, I agree, let's set the default to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (and let's change the default to true
)
@b-deam It looks like in order to log just a single line it would need to collect a node stats sample and check if the key exists there inside |
I think that's an even better idea, upon sleeping on it my suggestion probably wasn't the best idea. |
This PR adds operating system cgroup stats to the
node-stats
telemetry device.Note, the Nodes Stats API returns the values under
os.cgroup.memory
as strings and are converted to integers in this change. elastic/elasticsearch#93429 requests the API return integer types for these fields.