You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The otel collector k8sclusterreceiver was meant to have 1:1 functionality with the signalfx agent kubernetes cluster receiver so that signalfx users could migrate to the otel collector without issue. Some users have reported metric reporting discrepancies between the otel collector k8sclusterreceiver and signalfx agent kubernetes cluster receiver which is blocking them from migrating. We also want the otel collector k8sclusterreceiver to follow the opentelemetry cpu specs (the signalfx agent already follows this spec).
The problematic metrics:
k8s.container.cpu_request, a bug causes a 1000x scaling issue
k8s.container.cpu_limit, a bug causes a 1000x scaling issue
k8s.node.allocatable_cpu, a bug silently breaks reporting this metric at all
Steps to reproduce
For k8s.container.cpu_request and k8s.container.cpu_limit, setup the SignalFx agent and the Otel Collector to monitor the same Kubernetes containers.
You'll get metrics that look like this:
SignalFx Agent kubernetes cluster receiver
k8s.container.cpu_request: 0.3 (cpu units)
k8s.container.cpu_limit: 0.3 (cpu units)
Otel Col k8s cluster receiver
k8s.container.cpu_request: 300 (millicpu units)
k8s.container.cpu_limit: 300 (millicpu units)
This scaling issue was caused by a SignalFX agent PR that was merged in 2020 but no equivalent PR was merged into the otel-collector-contrib.
For k8s.node.allocatable_cpu, setup the Otel Collector k8sclusterreceiver to monitor a kubernetes node. In the debug logs you'll notice k8s.node.allocatable_cpu is being dropped all the time and never actually reported to the backend.
2022-02-24T00:39:14.969Z error collection/nodes.go:72 metric cpu has value {{1930 -3} {<nil>} 1930m DecimalSI} which is not convertable to int64 {"kind": "receiver", "name": "k8s_cluster"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.getMetricsForNode
/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/nodes.go:72
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sclusterreceiver/internal/collection.(*DataCollector).SyncMetrics
/usr/.../opentelemetry-collector-contrib/receiver/[email protected]/internal/collection/collector.go:153
....
The dropping of this metric is fixable with some small changes. We should have k8s.node.allocatable_cpu be treated as a double series metric instead of an (erroring) int series metric.
What version did you use?
Version: opentelemetry-collector-contrib v44, Splunk Otel Collector v44
What config did you use?
Splunk Otel Collector Config:
receivers:
k8s_cluster:
auth_type: serviceAccount
allocatable_types_to_report: ["cpu", "memory", "ephemeral-storage", "storage"]
Environment
AWS EKS 1.20
Additional context
Fixing the metric scaling issues mentioned in this ticket can be a breaking change to users who already are using the otel collector k8sclusterreceiver, so I would like to propose we deploy the fix behind a featuregate.
The deployment plan for this fix would look like:
An alpha stage where the fix is disabled by default and must be enabled through a Gate.
A beta stage where the fix has been well announced and is enabled by default but can be disabled through a Gate.
A generally available stage where the fix is permanently enabled and the Gate is no longer operative.
The text was updated successfully, but these errors were encountered:
Describe the bug
The otel collector k8sclusterreceiver was meant to have 1:1 functionality with the signalfx agent kubernetes cluster receiver so that signalfx users could migrate to the otel collector without issue. Some users have reported metric reporting discrepancies between the otel collector k8sclusterreceiver and signalfx agent kubernetes cluster receiver which is blocking them from migrating. We also want the otel collector k8sclusterreceiver to follow the opentelemetry cpu specs (the signalfx agent already follows this spec).
The problematic metrics:
Steps to reproduce
For k8s.container.cpu_request and k8s.container.cpu_limit, setup the SignalFx agent and the Otel Collector to monitor the same Kubernetes containers.
You'll get metrics that look like this:
SignalFx Agent kubernetes cluster receiver
Otel Col k8s cluster receiver
This scaling issue was caused by a SignalFX agent PR that was merged in 2020 but no equivalent PR was merged into the otel-collector-contrib.
For k8s.node.allocatable_cpu, setup the Otel Collector k8sclusterreceiver to monitor a kubernetes node. In the debug logs you'll notice k8s.node.allocatable_cpu is being dropped all the time and never actually reported to the backend.
The dropping of this metric is fixable with some small changes. We should have k8s.node.allocatable_cpu be treated as a double series metric instead of an (erroring) int series metric.
What version did you use?
Version: opentelemetry-collector-contrib v44, Splunk Otel Collector v44
What config did you use?
Splunk Otel Collector Config:
receivers:
k8s_cluster:
auth_type: serviceAccount
allocatable_types_to_report: ["cpu", "memory", "ephemeral-storage", "storage"]
Environment
AWS EKS 1.20
Additional context
Fixing the metric scaling issues mentioned in this ticket can be a breaking change to users who already are using the otel collector k8sclusterreceiver, so I would like to propose we deploy the fix behind a featuregate.
The deployment plan for this fix would look like:
The text was updated successfully, but these errors were encountered: