Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add operating system cgroup stats to node-stats telemetry #1663

Merged
merged 13 commits into from
Mar 9, 2023
2 changes: 2 additions & 0 deletions docs/telemetry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ The node-stats telemetry device regularly calls the `cluster node-stats API <htt
* JVM buffer pool stats (key ``jvm.buffer_pools`` in the node-stats API)
* JVM gc stats (key ``jvm.gc`` in the node-stats API)
* OS mem stats (key ``os.mem`` in the node-stats API)
* OS cgroup stats (key ``os.cgroup`` in the node-stats API)
* JVM mem stats (key ``jvm.mem`` in the node-stats API)
* Circuit breaker stats (key ``breakers`` in the node-stats API)
* Network-related stats (key ``transport`` in the node-stats API)
Expand All @@ -132,6 +133,7 @@ Supported telemetry parameters:
* ``node-stats-include-breakers`` (default: ``true``): A boolean indicating whether circuit breaker stats should be included.
* ``node-stats-include-gc`` (default: ``true``): A boolean indicating whether JVM gc stats should be included.
* ``node-stats-include-mem`` (default: ``true``): A boolean indicating whether both JVM heap, and OS mem stats should be included.
* ``node-stats-include-cgroup`` (default: ``true``): A boolean to include or exclude operating system cgroup stats.
* ``node-stats-include-network`` (default: ``true``): A boolean indicating whether network-related stats should be included.
* ``node-stats-include-process`` (default: ``true``): A boolean indicating whether process cpu stats should be included.
* ``node-stats-include-indexing-pressure`` (default: ``true``): A boolean indicating whether indexing pressuer stats should be included.
Expand Down
10 changes: 10 additions & 0 deletions esrally/telemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -816,6 +816,7 @@ def __init__(self, telemetry_params, cluster_name, client, metrics_store):
self.include_network = telemetry_params.get("node-stats-include-network", True)
self.include_process = telemetry_params.get("node-stats-include-process", True)
self.include_mem_stats = telemetry_params.get("node-stats-include-mem", True)
self.include_cgroup_stats = telemetry_params.get("node-stats-include-cgroup", True)
self.include_gc_stats = telemetry_params.get("node-stats-include-gc", True)
self.include_indexing_pressure = telemetry_params.get("node-stats-include-indexing-pressure", True)
self.client = client
Expand Down Expand Up @@ -845,6 +846,8 @@ def record(self):
if self.include_mem_stats:
collected_node_stats.update(self.jvm_mem_stats(node_name, node_stats))
collected_node_stats.update(self.os_mem_stats(node_name, node_stats))
if self.include_cgroup_stats:
collected_node_stats.update(self.os_cgroup_stats(node_name, node_stats))
if self.include_gc_stats:
collected_node_stats.update(self.jvm_gc_stats(node_name, node_stats))
if self.include_network:
Expand Down Expand Up @@ -907,6 +910,13 @@ def jvm_mem_stats(self, node_name, node_stats):
def os_mem_stats(self, node_name, node_stats):
return self.flatten_stats_fields(prefix="os_mem", stats=node_stats["os"]["mem"])

def os_cgroup_stats(self, node_name, node_stats):
# Convert strings returned by the Node Stats API for os.cgroup.memory limits
# https://github.com/elastic/elasticsearch/issues/93429
for k in ("limit_in_bytes", "usage_in_bytes"):
node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look safe to me. First of all I see that node-stats-include-cgroup by default is set to true, so it'll try to collect those stats always when the telemetry device is enabled.

But then you are assuming that the dict node_stats already has the keys os / group / memory here.

In the logs against a fresh 8.6.1 cluster I saw the following warning:

2023-02-06 11:02:25,752 ActorAddr-(T|:37073)/PID:439715 esrally.telemetry ERROR Could not determine node stats
Traceback (most recent call last):

  File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 172, in run
    self.recorder.record()

  File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 850, in record
    collected_node_stats.update(self.os_cgroup_stats(node_name, node_stats))

  File "/home/dl/source/elastic/rally/esrally/telemetry.py", line 917, in os_cgroup_stats
    node_stats["os"]["cgroup"]["memory"].update({k: int(node_stats["os"]["cgroup"]["memory"].get(k))})

ValueError: invalid literal for int() with base 10: 'max'

When I did some test runs against a real metric store I didn't find any os cgroup related stats (target ES running in docker 8.6.1).

When we aren't sure that a dict has keys present, we should use the setdefault() method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to share the output of GET /_nodes/stats?filter_path=**.os.cgroup from your 8.6.1 cluster? I did not encounter any values named max, though it looks like ES will try fetching it in the form of memory.max from cgroup fs along with memory.current.

I'll need to rework this and fix the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to share the output of GET /_nodes/stats?filter_path=**.os.cgroup from your 8.6.1 cluster? I did not encounter any values named max, though it looks like ES will try fetching it in the form of memory.max from cgroup fs along with memory.current.

sure thing. Here you go:

{
  "nodes" : {
    "yPfEvz0MSHS1A9Oq4YBWzQ" : {
      "os" : {
        "cgroup" : {
          "cpuacct" : {
            "control_group" : "/",
            "usage_nanos" : 62952671
          },
          "cpu" : {
            "control_group" : "/",
            "cfs_period_micros" : 100000,
            "cfs_quota_micros" : -1,
            "stat" : {
              "number_of_elapsed_periods" : 0,
              "number_of_times_throttled" : 0,
              "time_throttled_nanos" : 0
            }
          },
          "memory" : {
            "control_group" : "/",
            "limit_in_bytes" : "max",
            "usage_in_bytes" : "35197526016"
          }
        }
      }
    }
  }
}

return self.flatten_stats_fields(prefix="os_cgroup", stats=node_stats["os"]["cgroup"])

def jvm_gc_stats(self, node_name, node_stats):
return self.flatten_stats_fields(prefix="jvm_gc", stats=node_stats["jvm"]["gc"])

Expand Down
24 changes: 24 additions & 0 deletions tests/telemetry_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -2089,6 +2089,14 @@ class TestNodeStatsRecorder:
"os_mem_used_in_bytes": 57342185472,
"os_mem_free_percent": 8,
"os_mem_used_percent": 92,
"os_cgroup_cpuacct_usage_nanos": 1394207523870751,
"os_cgroup_cpu_cfs_period_micros": 100000,
"os_cgroup_cpu_cfs_quota_micros": 793162,
"os_cgroup_cpu_stat_number_of_elapsed_periods": 41092415,
"os_cgroup_cpu_stat_number_of_times_throttled": 41890,
"os_cgroup_cpu_stat_time_throttled_nanos": 29380593023188,
"os_cgroup_memory_limit_in_bytes": 62277025792,
"os_cgroup_memory_usage_in_bytes": 57342185472,
"process_cpu_percent": 10,
"process_cpu_total_in_millis": 56520,
"breakers_parent_limit_size_in_bytes": 726571417,
Expand Down Expand Up @@ -2481,6 +2489,14 @@ def test_stores_all_nodes_stats(self, metrics_store_put_doc):
"os_mem_used_in_bytes": 57342185472,
"os_mem_free_percent": 8,
"os_mem_used_percent": 92,
"os_cgroup_cpuacct_usage_nanos": 1394207523870751,
"os_cgroup_cpu_cfs_period_micros": 100000,
"os_cgroup_cpu_cfs_quota_micros": 793162,
"os_cgroup_cpu_stat_number_of_elapsed_periods": 41092415,
"os_cgroup_cpu_stat_number_of_times_throttled": 41890,
"os_cgroup_cpu_stat_time_throttled_nanos": 29380593023188,
"os_cgroup_memory_limit_in_bytes": 62277025792,
"os_cgroup_memory_usage_in_bytes": 57342185472,
"transport_rx_count": 77,
"transport_rx_size_in_bytes": 98723498,
"transport_server_open": 12,
Expand Down Expand Up @@ -2794,6 +2810,14 @@ def test_stores_selected_indices_metrics_from_nodes_stats(self, metrics_store_pu
"os_mem_used_in_bytes": 57342185472,
"os_mem_free_percent": 8,
"os_mem_used_percent": 92,
"os_cgroup_cpuacct_usage_nanos": 1394207523870751,
"os_cgroup_cpu_cfs_period_micros": 100000,
"os_cgroup_cpu_cfs_quota_micros": 793162,
"os_cgroup_cpu_stat_number_of_elapsed_periods": 41092415,
"os_cgroup_cpu_stat_number_of_times_throttled": 41890,
"os_cgroup_cpu_stat_time_throttled_nanos": 29380593023188,
"os_cgroup_memory_limit_in_bytes": 62277025792,
"os_cgroup_memory_usage_in_bytes": 57342185472,
"transport_rx_count": 77,
"transport_rx_size_in_bytes": 98723498,
"transport_server_open": 12,
Expand Down