Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InmemSink: create a new RWMutex , because they are not safe to copy #124

Merged
merged 1 commit into from
Apr 30, 2021

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Apr 28, 2021

We noticed some Consul tests deadlock and timeout when run with the race detector (example). I tracked down the problem to this line in InmemSink.Data.

In 1d3de9f there was a change made to fix some data races in this method. It looks like the RWMutex field was missed, possibly because the issue isn't a data race itself, so isn't reported by the race detector.

RWMutex are not safe for copying, so we need to create a new instance on the copy of IntervalMetrics.

Before this change I was able to pretty reliable cause the test to deadlock in under 200 runs. After this change the test no longer deadlocks after 1000+ runs. I haven't looked at the RWMutex internals, but I assume the docs mention they can not be copied for this reason.

Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dnephin added a commit to hashicorp/consul that referenced this pull request Apr 28, 2021
We noticed that TestUpstreamListener would deadlock sometimes when run
with the race detector. While debugging this issue I found and fixed the
following problems.

1. the net.Listener was not being closed properly when Listener.Stop was
   called. This caused the Listener.Serve goroutine to run forever.
   Fixed by storing a reference to net.Listener and closing it properly
   when Listener.Stop is called.
2. call connWG.Add in the correct place. WaitGroup.Add must be called
   before starting a goroutine, not from inside the goroutine.
3. Set metrics config EnableRuntimeMetrics to `false` so that we don't
   start a background goroutine in each test for no reason. There is no
   way to shutdown this goroutine, and it was an added distraction while
   debugging these timeouts.
5. two tests were calling require.NoError from a goroutine.
   require.NoError calls t.FailNow, which MUST be called from the main
   test goroutine. Instead use t.Errorf, which can be called from other
   goroutines and will still fail the test.
6. `assertCurrentGaugeValue` wass breaking out of a for loop, which
   would cause the `RWMutex.RUnlock` to be missed. Fixed by calling
   unlock before `break`.

The core issue of a deadlock  was fixed by hashicorp/go-metrics#124.
@banks banks merged commit a054c40 into hashicorp:master Apr 30, 2021
@dnephin dnephin deleted the dnephin/fix-inmem-mutex-copy branch September 12, 2021 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants