[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

jvoravong · 2022-02-24T16:48:26Z

Describe the bug
The otel collector k8sclusterreceiver was meant to have 1:1 functionality with the signalfx agent kubernetes cluster receiver so that signalfx users could migrate to the otel collector without issue. Some users have reported metric reporting discrepancies between the otel collector k8sclusterreceiver and signalfx agent kubernetes cluster receiver which is blocking them from migrating. We also want the otel collector k8sclusterreceiver to follow the opentelemetry cpu specs (the signalfx agent already follows this spec).

The problematic metrics:

k8s.container.cpu_request, a bug causes a 1000x scaling issue
k8s.container.cpu_limit, a bug causes a 1000x scaling issue
k8s.node.allocatable_cpu, a bug silently breaks reporting this metric at all

Steps to reproduce
For k8s.container.cpu_request and k8s.container.cpu_limit, setup the SignalFx agent and the Otel Collector to monitor the same Kubernetes containers.

You'll get metrics that look like this:
SignalFx Agent kubernetes cluster receiver

k8s.container.cpu_request: 0.3 (cpu units)
k8s.container.cpu_limit: 0.3 (cpu units)

Otel Col k8s cluster receiver

k8s.container.cpu_request: 300 (millicpu units)
k8s.container.cpu_limit: 300 (millicpu units)
This scaling issue was caused by a SignalFX agent PR that was merged in 2020 but no equivalent PR was merged into the otel-collector-contrib.

For k8s.node.allocatable_cpu, setup the Otel Collector k8sclusterreceiver to monitor a kubernetes node. In the debug logs you'll notice k8s.node.allocatable_cpu is being dropped all the time and never actually reported to the backend.

2022-02-24T00:39:14.969Z	error	collection/nodes.go:72	metric cpu has value {{1930 -3} {<nil>} 1930m DecimalSI} which is not convertable to int64	{"kind": "receiver", "name": "k8s_cluster"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.getMetricsForNode
	/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/nodes.go:72
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.(*DataCollector).SyncMetrics
	/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/collector.go:153
....

The dropping of this metric is fixable with some small changes. We should have k8s.node.allocatable_cpu be treated as a double series metric instead of an (erroring) int series metric.

What version did you use?
Version: opentelemetry-collector-contrib v44, Splunk Otel Collector v44

What config did you use?
Splunk Otel Collector Config:
receivers:
k8s_cluster:
auth_type: serviceAccount
allocatable_types_to_report: ["cpu", "memory", "ephemeral-storage", "storage"]

Environment
AWS EKS 1.20

Additional context
Fixing the metric scaling issues mentioned in this ticket can be a breaking change to users who already are using the otel collector k8sclusterreceiver, so I would like to propose we deploy the fix behind a featuregate.

The deployment plan for this fix would look like:

An alpha stage where the fix is disabled by default and must be enabled through a Gate.
A beta stage where the fix has been well announced and is enabled by default but can be disabled through a Gate.
A generally available stage where the fix is permanently enabled and the Gate is no longer operative.

The text was updated successfully, but these errors were encountered:

jvoravong · 2022-02-24T16:56:18Z

I would like to work on this ticket if it is accepted.

jvoravong added the bug Something isn't working label Feb 24, 2022

mx-psi added the comp:kubernetes Kubernetes-related components label Feb 25, 2022

tigrannajaryan assigned jvoravong Feb 25, 2022

jvoravong mentioned this issue Mar 1, 2022

[receiver/k8sclusterreceiver] Fix k8s node and container cpu metrics not being reported properly #8245

Merged

jvoravong closed this as completed Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

jvoravong commented Feb 24, 2022 •

edited

Loading

jvoravong commented Feb 24, 2022

[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

Comments

jvoravong commented Feb 24, 2022 • edited Loading

jvoravong commented Feb 24, 2022

jvoravong commented Feb 24, 2022 •

edited

Loading