-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: de-flake TestRefreshPendingCommands #19404
Conversation
The test ran a for loop without preemption points. The loop checked a condition that would only become true after another goroutine had been scheduled and carried out its job. If, with only few cores (four in my case) GC kicked in before that other goroutine got scheduled, that loop would just run hot forever until the test timed out, and the resulting stack dump looked quite unhelpful. Add a small sleep so the runtime can preempt the goroutine. The issue was harder to run into when stressing only the test, since there was less garbage available at that point. Adding some print statements, I accidentally made it much more likely. Previously flaked (got stuck) within <500iters, now ran past 1.5k without problems. Fixes cockroachdb#19397. Fixes cockroachdb#19388. Touches cockroachdb#19367. Fixes cockroachdb#18554.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much plause for figuring this one out, @tschottdorf!
For future readers: golang/go#10958 |
ps @a-robinson I also introduced this loop, so ☘️ (shame-rock). |
Many thanks for tracking down this flakiness so I don't have to. 🍻 Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file):
Was it really PS Rather than Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
The inlining is almost certainly why this goroutine wasn't preemptible, but I would have assumed that it was the other tests being stressed that were taking up the other threads. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
No, I was stressing just this test. Here you go: https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf
main goroutine:
Full gist: Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
https://gist.github.com/tschottdorf/cf092bb02ec624e68f5d7e6ac953b2bf Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'll send a PR to clarify the comment. Thanks again for tracking this down. Comments from Reviewable |
Awesome catch @tschottdorf! Is it too early for a peer ack? Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
@petermattis if you're going to clarify, could you swap over to Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, though I'm having trouble reproducing without this change which is why I haven't sent that PR. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
How many cores are you running this on? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/client_raft_test.go, line 1097 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Comments from Reviewable |
Non-preemptible loops prevent GC from starting which will cause the whole process to lock up. See cockroachdb#19404
The test ran a for loop without preemption points. The loop checked a
condition that would only become true after another goroutine had been
scheduled and carried out its job.
If, with only few cores (four in my case) GC kicked in before that other
goroutine got scheduled, that loop would just run hot forever until the
test timed out, and the resulting stack dump looked quite unhelpful.
Add a small sleep so the runtime can preempt the goroutine.
The issue was harder to run into when stressing only the test, since there
was less garbage available at that point. Adding some print statements,
I accidentally made it much more likely.
Previously flaked (got stuck) within <500iters, now ran past 1.5k without
problems.
Fixes #19397.
Fixes #19388.
Touches #19367.
Fixes #18554.