Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newSizedChannel does not properly close on exporter shutdown #11401

Closed
Tarmander opened this issue Oct 8, 2024 · 2 comments · Fixed by #11745
Closed

newSizedChannel does not properly close on exporter shutdown #11401

Tarmander opened this issue Oct 8, 2024 · 2 comments · Fixed by #11745
Labels
bug Something isn't working

Comments

@Tarmander
Copy link

Describe the bug
This is related to an issue with the exporter/loadbalancingexporter. The k8s resolver would continuously Shutdown() and create 2 new boundedMemoryQueue's every time the endpoints were "updated" (roughly every 3 minutes). This behavior went unnoticed until the Memory Limiter Processor started to drop spans.

After investigation with the pprof extension, we realized that we had an unbounded memory leak, and the root cause was that each time an exporter and subsequent queue were shutdown, the underlying channel was not GC'd properly. We'd continue allocating a new channel each update until we ran OOM.

image

Steps to reproduce
Configure the k8s resolver to point to a service with many endpoints (the more endpoints the quicker the memory increase). You can run with the pprof extension to see the memory increase in newSizedChannel over time.

What did you expect to see?
All exporters/queues/channels to be properly Shutdown() and GC'd.

What did you see instead?
Channels in existing exporter queues were not disposed of, and eventually they used up all resources in the pod.

What version did you use?
v0.105.0

What config did you use?

receivers:
  otlp:
    protocols:
      grpc: { }
      http: { }
processors:
  batch:
    timeout: 1s
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        sending_queue:
          queue_size: 100000
          num_consumers: 25
    resolver:
      k8s:
        service: opentelemetry-global-gateway-collector-headless.opentelemetry-global-collector
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: localhost:1777
service:
  extensions: [ health_check, zpages, pprof ]
  telemetry:
    logs:
      level: info
      encoding: json
    metrics:
      address: 0.0.0.0:8888
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, batch ]
      exporters: [ loadbalancing ]

Environment
OS: Ubuntu 22.04
Compiler: go1.22.6
kubenertes version: apiVersion: opentelemetry.io/v1beta1

@Tarmander Tarmander added the bug Something isn't working label Oct 8, 2024
@madaraszg-tulip
Copy link
Contributor

I have noticed a very similar issue using the loadbalancing exporter in grafana alloy which uses this component. Here are some pyroscope screenshots:

image

image

@madaraszg-tulip
Copy link
Contributor

madaraszg-tulip commented Nov 25, 2024

https://github.com/open-telemetry/opentelemetry-collector/blob/v0.114.0/exporter/exporterhelper/internal/metadata/generated_telemetry.go#L58-L92

	_, err = builder.meter.RegisterCallback(func(_ context.Context, o metric.Observer) error {
		o.ObserveInt64(builder.ExporterQueueCapacity, cb(), opts...)
		return nil
	}, builder.ExporterQueueCapacity)

The return values of the RegisterCallback functions are ignored. They hold references that can be used to unregister these callbacks when the exporter is being shut down. I assume they should be returned to the caller exporter/exporterhelper/internal/queue_sender Start(), to be later used in Shutdown()

github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Fix memory leak at exporter shutdown. At startup time the exporter
creates metric callbacks that hold references to the exporter queue,
these need to be unregistered at shutdown.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #11401

---------

Co-authored-by: Alex Boten <[email protected]>
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this issue Dec 19, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Fix memory leak at exporter shutdown. At startup time the exporter
creates metric callbacks that hold references to the exporter queue,
these need to be unregistered at shutdown.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes open-telemetry#11401

---------

Co-authored-by: Alex Boten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants