Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: de-flake TestRefreshPendingCommands #19404

Merged
merged 1 commit into from
Oct 20, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 20, 2017

The test ran a for loop without preemption points. The loop checked a
condition that would only become true after another goroutine had been
scheduled and carried out its job.

If, with only few cores (four in my case) GC kicked in before that other
goroutine got scheduled, that loop would just run hot forever until the
test timed out, and the resulting stack dump looked quite unhelpful.

Add a small sleep so the runtime can preempt the goroutine.

The issue was harder to run into when stressing only the test, since there
was less garbage available at that point. Adding some print statements,
I accidentally made it much more likely.

Previously flaked (got stuck) within <500iters, now ran past 1.5k without
problems.

Fixes #19397.
Fixes #19388.
Touches #19367.
Fixes #18554.

The test ran a for loop without preemption points. The loop checked a
condition that would only become true after another goroutine had been
scheduled and carried out its job.

If, with only few cores (four in my case) GC kicked in before that other
goroutine got scheduled, that loop would just run hot forever until the
test timed out, and the resulting stack dump looked quite unhelpful.

Add a small sleep so the runtime can preempt the goroutine.

The issue was harder to run into when stressing only the test, since there
was less garbage available at that point. Adding some print statements,
I accidentally made it much more likely.

Previously flaked (got stuck) within <500iters, now ran past 1.5k without
problems.

Fixes cockroachdb#19397.
Fixes cockroachdb#19388.
Touches cockroachdb#19367.
Fixes cockroachdb#18554.
@tbg tbg requested a review from a team October 20, 2017 12:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much plause for figuring this one out, @tschottdorf!

@tbg
Copy link
Member Author

tbg commented Oct 20, 2017

For future readers: golang/go#10958

@tbg tbg merged commit 3b7e1fa into cockroachdb:master Oct 20, 2017
@tbg tbg deleted the no-livelock-for branch October 20, 2017 13:41
@tbg
Copy link
Member Author

tbg commented Oct 20, 2017

ps @a-robinson I also introduced this loop, so ☘️ (shame-rock).

@petermattis
Copy link
Collaborator

:lgtm:

Many thanks for tracking down this flakiness so I don't have to. 🍻


Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

				// are in GC, or you even only have one to begin with!), it can end up
				// spinning forever. In the case that prompted adding this, there are
				// four cores and "gc assist" got us stuck.

Was it reallygc assist that got stuck? I looked at the stacks I didn't see that. If a loop is not preemptible, GC will block trying to stop-the-world which will eventually lock up all other goroutines. I imaging the loop is not preemptible because IsDraining() was inlined.

PS Rather than time.Sleep(), I think you could use runtime.Gosched(), though the time.Sleep() is perfectly fine. My quibble is about the comment.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Was it reallygc assist that got stuck? I looked at the stacks I didn't see that. If a loop is not preemptible, GC will block trying to stop-the-world which will eventually lock up all other goroutines. I imaging the loop is not preemptible because IsDraining() was inlined.

PS Rather than time.Sleep(), I think you could use runtime.Gosched(), though the time.Sleep() is perfectly fine. My quibble is about the comment.

The inlining is almost certainly why this goroutine wasn't preemptible, but I would have assumed that it was the other tests being stressed that were taking up the other threads.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 20, 2017

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The inlining is almost certainly why this goroutine wasn't preemptible, but I would have assumed that it was the other tests being stressed that were taking up the other threads.

No, I was stressing just this test. Here you go: https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf

goroutine 889 [GC assist wait]:
github.com/cockroachdb/cockroach/pkg/util/log.MakeMessage(0x6568c20, 0xc4200140d8, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0, 0xc4205dc6c0, 0x74b7bc0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:127 +0x31
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6568c20, 0xc4200140d8, 0x2, 0x2, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:141 +0xc9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6568c20, 0xc4200140d8, 0x1, 0x2, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:56 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Warningf(0x6568c20, 0xc4200140d8, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:99 +0x7e
github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1.1(0xc420abb6c0, 0xc4202f8020, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1078 +0x6d
created by github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1077 +0x765

main goroutine:

goroutine 508 [running]:
sync/atomic.(*Value).Load(...)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1785
github.com/cockroachdb/cockroach/pkg/storage.(*Store).IsDraining(...)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1785
github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1(0xc420d340f0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1094 +0x7c0 fp=0xc420db9fa8 sp=0xc420db9af0 pc=0x514e4c0
testing.tRunner(0xc420d340f0, 0xc4202a47e0)
	/usr/local/Cellar/go/1.9/libexec/src/testing/testing.go:746 +0xd0 fp=0xc420db9fd0 sp=0xc420db9fa8 pc=0x40f81b0
runtime.goexit()
	/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc420db9fd8 sp=0xc420db9fd0 pc=0x405fd51
created by testing.(*T).Run
	/usr/local/Cellar/go/1.9/libexec/src/testing/testing.go:789 +0x2de

Full gist:


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 20, 2017

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

No, I was stressing just this test. Here you go: https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf

goroutine 889 [GC assist wait]:
github.com/cockroachdb/cockroach/pkg/util/log.MakeMessage(0x6568c20, 0xc4200140d8, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0, 0xc4205dc6c0, 0x74b7bc0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:127 +0x31
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6568c20, 0xc4200140d8, 0x2, 0x2, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:141 +0xc9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6568c20, 0xc4200140d8, 0x1, 0x2, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:56 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Warningf(0x6568c20, 0xc4200140d8, 0x58d4f3b, 0x11, 0x0, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:99 +0x7e
github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1.1(0xc420abb6c0, 0xc4202f8020, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1078 +0x6d
created by github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1077 +0x765

main goroutine:

goroutine 508 [running]:
sync/atomic.(*Value).Load(...)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1785
github.com/cockroachdb/cockroach/pkg/storage.(*Store).IsDraining(...)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1785
github.com/cockroachdb/cockroach/pkg/storage_test.TestRefreshPendingCommands.func1(0xc420d340f0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_test.go:1094 +0x7c0 fp=0xc420db9fa8 sp=0xc420db9af0 pc=0x514e4c0
testing.tRunner(0xc420d340f0, 0xc4202a47e0)
	/usr/local/Cellar/go/1.9/libexec/src/testing/testing.go:746 +0xd0 fp=0xc420db9fd0 sp=0xc420db9fa8 pc=0x40f81b0
runtime.goexit()
	/usr/local/Cellar/go/1.9/libexec/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc420db9fd8 sp=0xc420db9fd0 pc=0x405fd51
created by testing.(*T).Run
	/usr/local/Cellar/go/1.9/libexec/src/testing/testing.go:789 +0x2de

Full gist:

https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf

GC assist wait is the state goroutines hit when they are called on to perform GC work but no work is available. And no work is available because the GC is waiting for this looping goroutine to stop.

I'll send a PR to clarify the comment. Thanks again for tracking this down.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

Awesome catch @tschottdorf! Is it too early for a peer ack?


Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

GC assist wait is the state goroutines hit when they are called on to perform GC work but no work is available. And no work is available because the GC is waiting for this looping goroutine to stop.

I'll send a PR to clarify the comment. Thanks again for tracking this down.

@petermattis if you're going to clarify, could you swap over to runtime.Gosched() while you're at it? That was my first thought when reading this change too. Even though it will have pretty much the same effect as time.Sleep(time.Nanosecond), it's a bit more explicit.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@petermattis if you're going to clarify, could you swap over to runtime.Gosched() while you're at it? That was my first thought when reading this change too. Even though it will have pretty much the same effect as time.Sleep(time.Nanosecond), it's a bit more explicit.

Yes, though I'm having trouble reproducing without this change which is why I haven't sent that PR.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes, though I'm having trouble reproducing without this change which is why I haven't sent that PR.

How many cores are you running this on?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How many cores are you running this on?

  1. Turns out I was seeing the failure, just not recognizing it.

Comments from Reviewable

petermattis added a commit to petermattis/cockroach that referenced this pull request Oct 20, 2017
Non-preemptible loops prevent GC from starting which will cause the
whole process to lock up.

See cockroachdb#19404
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants