Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/k8sclusterreceiver] Some k8s node and container cpu metrics are not being reported properly #8115

Closed
jvoravong opened this issue Feb 24, 2022 · 1 comment
Assignees
Labels
bug Something isn't working comp:kubernetes Kubernetes-related components

Comments

@jvoravong
Copy link
Contributor

jvoravong commented Feb 24, 2022

Describe the bug
The otel collector k8sclusterreceiver was meant to have 1:1 functionality with the signalfx agent kubernetes cluster receiver so that signalfx users could migrate to the otel collector without issue. Some users have reported metric reporting discrepancies between the otel collector k8sclusterreceiver and signalfx agent kubernetes cluster receiver which is blocking them from migrating. We also want the otel collector k8sclusterreceiver to follow the opentelemetry cpu specs (the signalfx agent already follows this spec).

The problematic metrics:

  • k8s.container.cpu_request, a bug causes a 1000x scaling issue
  • k8s.container.cpu_limit, a bug causes a 1000x scaling issue
  • k8s.node.allocatable_cpu, a bug silently breaks reporting this metric at all

Steps to reproduce
For k8s.container.cpu_request and k8s.container.cpu_limit, setup the SignalFx agent and the Otel Collector to monitor the same Kubernetes containers.

You'll get metrics that look like this:
SignalFx Agent kubernetes cluster receiver

  • k8s.container.cpu_request: 0.3 (cpu units)
  • k8s.container.cpu_limit: 0.3 (cpu units)

Otel Col k8s cluster receiver

  • k8s.container.cpu_request: 300 (millicpu units)
  • k8s.container.cpu_limit: 300 (millicpu units)
    This scaling issue was caused by a SignalFX agent PR that was merged in 2020 but no equivalent PR was merged into the otel-collector-contrib.

For k8s.node.allocatable_cpu, setup the Otel Collector k8sclusterreceiver to monitor a kubernetes node. In the debug logs you'll notice k8s.node.allocatable_cpu is being dropped all the time and never actually reported to the backend.

2022-02-24T00:39:14.969Z	error	collection/nodes.go:72	metric cpu has value {{1930 -3} {<nil>} 1930m DecimalSI} which is not convertable to int64	{"kind": "receiver", "name": "k8s_cluster"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.getMetricsForNode
	/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/nodes.go:72
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.(*DataCollector).SyncMetrics
	/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/collector.go:153
....

The dropping of this metric is fixable with some small changes. We should have k8s.node.allocatable_cpu be treated as a double series metric instead of an (erroring) int series metric.

What version did you use?
Version: opentelemetry-collector-contrib v44, Splunk Otel Collector v44

What config did you use?
Splunk Otel Collector Config:
receivers:
k8s_cluster:
auth_type: serviceAccount
allocatable_types_to_report: ["cpu", "memory", "ephemeral-storage", "storage"]

Environment
AWS EKS 1.20

Additional context
Fixing the metric scaling issues mentioned in this ticket can be a breaking change to users who already are using the otel collector k8sclusterreceiver, so I would like to propose we deploy the fix behind a featuregate.

The deployment plan for this fix would look like:

  • An alpha stage where the fix is disabled by default and must be enabled through a Gate.
  • A beta stage where the fix has been well announced and is enabled by default but can be disabled through a Gate.
  • A generally available stage where the fix is permanently enabled and the Gate is no longer operative.
@jvoravong jvoravong added the bug Something isn't working label Feb 24, 2022
@jvoravong
Copy link
Contributor Author

I would like to work on this ticket if it is accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:kubernetes Kubernetes-related components
Projects
None yet
Development

No branches or pull requests

2 participants