fix: prevent DescribeLogDirs hang in admin client #2269
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes a goroutine/waitgroup bug in the admin client. If an unknown broker ID is passed to
DescribeLogDirs
, it will hang forever.It's a bug with waitgroups, where a waitgroup can be incremented without an associated decrement leading to an infinite wait. I think it's simple enough that it can be fixed with just visual inspection, but I included a test case just to be sure.
It's tricky to get a codebase into this scenario, but has occurred in our systems a few times. In summary, two goroutines using a shared
sarama.Client
:PartitionConsumer
PartitionConsumer
error recovery, eventually callsRefreshMetadata
, updating the internalbrokers
list in theClient
to remove the broker that's offline.DescribeLogDirs
with the original broker list, which still includes the dead brokerDescribeLogDirs
findBroker
, which doesn't find the dead broker inclient.Brokers()
continue
s the loop without firing off a goroutine with awg.Done()
to decrementwg.Wait()
hangs.This was noted in the original PR introducing this admin API call (#1646 (comment)), but the bug occurs before any network IO is involved.