-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kafkareceiver] fix: Kafka receiver blocking shutdown #35767
Merged
djaglowski
merged 3 commits into
open-telemetry:main
from
observIQ:fix-kafka-recv-blocking-shutdown
Oct 22, 2024
Merged
[receiver/kafkareceiver] fix: Kafka receiver blocking shutdown #35767
djaglowski
merged 3 commits into
open-telemetry:main
from
observIQ:fix-kafka-recv-blocking-shutdown
Oct 22, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thanks for taking care of this, it resolves an issue we are seeing in production with a high throughput logs pipeline. |
djaglowski
approved these changes
Oct 14, 2024
LGTM. @pavolloffay, @MovieStoreGuy, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment from me trying to sneak in some improvements, but the changes seem sensible.
sbylica-splunk
pushed a commit
to sbylica-splunk/opentelemetry-collector-contrib
that referenced
this pull request
Dec 17, 2024
…telemetry#35767) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Fixes an issue where the Kafka receiver would block on shutdown. There was an earlier fix for this issue [here](open-telemetry#32720). This does solve the issue, but it was only applied to the traces receiver, not the logs or metrics receiver. The issue is this go routine in the `Start()` functions for logs and metrics: ```go go func() { if err := c.consumeLoop(ctx, metricsConsumerGroup); err != nil { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` The `consumeLoop()` function returns a `context.Canceled` error when `Shutdown()` is called, which is expected. However `componentstatus.ReportStatus()` blocks while attempting to report this error. The reason/bug for this can be found [here](open-telemetry/opentelemetry-collector#9824). The previously mentioned PR fixed this for the traces receiver by checking if the error returned by `consumeLoop()` is `context.Canceled`: ```go go func() { if err := c.consumeLoop(ctx, consumerGroup); !errors.Is(err, context.Canceled) { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` Additionally, this is `consumeLoop()` for the traces receiver, with the logs and metrics versions being identical: ```go func (c *kafkaTracesConsumer) consumeLoop(ctx context.Context, handler sarama.ConsumerGroupHandler) error { for { // `Consume` should be called inside an infinite loop, when a // server-side rebalance happens, the consumer session will need to be // recreated to get the new claims if err := c.consumerGroup.Consume(ctx, c.topics, handler); err != nil { c.settings.Logger.Error("Error from consumer", zap.Error(err)) } // check if context was cancelled, signaling that the consumer should stop if ctx.Err() != nil { c.settings.Logger.Info("Consumer stopped", zap.Error(ctx.Err())) return ctx.Err() } } } ``` This does fix the issue, however the only error that can be returned by `consumeLoop()` is a canceled context. When we create the context and cancel function, we use `context.Background()`: ```go ctx, cancel := context.WithCancel(context.Background()) ``` This context is only used by `consumeLoop()` and the cancel function is only called in `Shutdown()`. Because `consumeLoop()` can only return a `context.Canceled` error, this PR removes this unused code for the logs, metrics, and traces receivers. Instead, `consumeLoop()` still logs the `context.Canceled` error but it does not return any error and the go routine simply just calls `consumeLoop()`. Additional motivation for removing the call to `componentstatus.ReportStatus()` is the underlying function called by it, `componentstatus.Report()` says it does not need to be called during `Shutdown()` or `Start()` as the service already does so for the given component, [comment here](https://github.com/open-telemetry/opentelemetry-collector/blob/main/component/componentstatus/status.go#L21-L25). Even if there wasn't a bug causing this call to block, the component still shouldn't call it since it would only be called during `Shutdown()`. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#30789 <!--Describe what testing was performed and which tests were added.--> #### Testing Tested in a build of the collector with these changes scraping logs from a Kafka instance. When the collector is stopped and `Shutdown()` gets called, the receiver did not block and the collector stopped gracefully as expected.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes an issue where the Kafka receiver would block on shutdown.
There was an earlier fix for this issue here. This does solve the issue, but it was only applied to the traces receiver, not the logs or metrics receiver.
The issue is this go routine in the
Start()
functions for logs and metrics:The
consumeLoop()
function returns acontext.Canceled
error whenShutdown()
is called, which is expected. Howevercomponentstatus.ReportStatus()
blocks while attempting to report this error. The reason/bug for this can be found here.The previously mentioned PR fixed this for the traces receiver by checking if the error returned by
consumeLoop()
iscontext.Canceled
:Additionally, this is
consumeLoop()
for the traces receiver, with the logs and metrics versions being identical:This does fix the issue, however the only error that can be returned by
consumeLoop()
is a canceled context. When we create the context and cancel function, we usecontext.Background()
:This context is only used by
consumeLoop()
and the cancel function is only called inShutdown()
.Because
consumeLoop()
can only return acontext.Canceled
error, this PR removes this unused code for the logs, metrics, and traces receivers. Instead,consumeLoop()
still logs thecontext.Canceled
error but it does not return any error and the go routine simply just callsconsumeLoop()
.Additional motivation for removing the call to
componentstatus.ReportStatus()
is the underlying function called by it,componentstatus.Report()
says it does not need to be called duringShutdown()
orStart()
as the service already does so for the given component, comment here. Even if there wasn't a bug causing this call to block, the component still shouldn't call it since it would only be called duringShutdown()
.Link to tracking issue
Fixes #30789
Testing
Tested in a build of the collector with these changes scraping logs from a Kafka instance. When the collector is stopped and
Shutdown()
gets called, the receiver did not block and the collector stopped gracefully as expected.