[chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely #37159

namco1992 · 2025-01-13T05:00:58Z

Description

I was exploring options for backpressure the pipeline when the exporter fails. Inspired by #29410 (comment), I realized that I could enable the retry_on_failure on the receiver side, and have it retry indefinitely by setting max_elapsed_time to 0.

receivers:
   filelog:
     include: [ input.log ]
     retry_on_failure:
       enabled: true
       max_elapsed_time: 0

With this config, the consumer will be blocked at the ConsumeLogs func in consumerretry when the exporter fails to consume the logs:

opentelemetry-collector-contrib/internal/coreinternal/consumerretry/logs.go

Line 35 in 12551d3

    
           func (lc *logsConsumer) ConsumeLogs(ctx context.Context, logs plog.Logs) error {

The func flusher() from the LogEmitter starts a loop and call the consumerFunc with context.Background(). When the ConsumeLogs is blocked by the retry, there is no way to cancel the retry, thus the LogEmitter will hang when I try to shut down the collector.

In this PR, I created a ctx in the Start func, which will be cancelled later in the Shutdown func. The ctx is passed to the flusher and used for the flush in every flushInterval. However, I have to swap it with another ctx with timeout during shutdown to flush the remaining batch out one last time. That's the best approach I can think of for now, and I'm open to other suggestions.

Signed-off-by: Mengnan Gong <[email protected]>

namco1992 · 2025-01-13T07:57:15Z

pkg/stanza/operator/helper/emitter.go

 	return nil
 }

 // Stop will close the log channel and stop running goroutines
 func (e *LogEmitter) Stop() error {
 	e.stopOnce.Do(func() {
-		close(e.closeChan)
+		// the cancel func could be nil if the emitter is never started.


I realized that quite a few unit tests call Stop() without actually starting the emitter, and that would fail because the context is not created if Start() is not called. I put a defensive check here.

I can also move the context creation to NewLogEmitter, which feels a bit less "natural". I also think some of the lifecycle tests should call Start(), but I didn't touch it in this PR to keep it focused. I can leave a TODO or create a separate PR to fix it. I am open to suggestions.

djaglowski

Thanks for the fix @namco1992

namco1992 · 2025-01-27T11:59:34Z

Thanks for the review @djaglowski, I think it's not changelog-worthy so I didn't add one, but I suppose I added the [chore] prefix too late. Lemme know if you think a changelog is required, thanks.

…the receiver retry indefinitely (open-telemetry#37159)  #### Description I was exploring options for backpressure the pipeline when the exporter fails. Inspired by open-telemetry#29410 (comment), I realized that I could enable the `retry_on_failure` on the receiver side, and have it retry indefinitely by setting `max_elapsed_time` to 0. ```yaml receivers: filelog: include: [ input.log ] retry_on_failure: enabled: true max_elapsed_time: 0 ``` With this config, the consumer will be blocked at the `ConsumeLogs` func in `consumerretry` when the exporter fails to consume the logs: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/12551d324375bd0c4647a8cdc7bd0f8c435c1034/internal/coreinternal/consumerretry/logs.go#L35 The func `flusher()` from the `LogEmitter` starts a loop and call the `consumerFunc` with `context.Background()`. When the `ConsumeLogs` is blocked by the retry, there is no way to cancel the retry, thus the `LogEmitter` will hang when I try to shut down the collector. In this PR, I created a ctx in the `Start` func, which will be cancelled later in the `Shutdown` func. The ctx is passed to the flusher and used for the flush in every `flushInterval`. However, I have to swap it with another ctx with timeout during shutdown to flush the remaining batch out one last time. That's the best approach I can think of for now, and I'm open to other suggestions. --------- Signed-off-by: Mengnan Gong <[email protected]> Co-authored-by: Daniel Jaglowski <[email protected]>

use a proper lifecycle context in the LogEmitter flusher

84e40a3

Signed-off-by: Mengnan Gong <[email protected]>

namco1992 requested review from djaglowski and a team as code owners January 13, 2025 05:00

github-actions bot assigned codeboten Jan 13, 2025

github-actions bot added the pkg/stanza label Jan 13, 2025

check the cancel func before calling it in the Stop func

1cada10

Signed-off-by: Mengnan Gong <[email protected]>

namco1992 changed the title ~~[pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely~~ [chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely Jan 13, 2025

namco1992 commented Jan 13, 2025

View reviewed changes

djaglowski approved these changes Jan 21, 2025

View reviewed changes

djaglowski added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jan 27, 2025

Merge branch 'main' into mengnan/fix-log-emitter-context

f9d1503

djaglowski merged commit 9d1d9bb into open-telemetry:main Jan 27, 2025
163 checks passed

github-actions bot added this to the next release milestone Jan 27, 2025

namco1992 deleted the mengnan/fix-log-emitter-context branch January 28, 2025 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely #37159

[chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely #37159

namco1992 commented Jan 13, 2025

namco1992 Jan 13, 2025 •

edited

Loading

djaglowski left a comment

namco1992 commented Jan 27, 2025

[chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely #37159

[chore][pkg/stanza] Fix the bug that the log emitter might hang when the receiver retry indefinitely #37159

Conversation

namco1992 commented Jan 13, 2025

Description

namco1992 Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

djaglowski left a comment

Choose a reason for hiding this comment

namco1992 commented Jan 27, 2025

namco1992 Jan 13, 2025 •

edited

Loading