Query scheduler: Fix a panic race in request queue handling #8451

seizethedave · 2024-06-20T18:22:45Z

What this PR does

Fixes a rarely-seen panic:

2024-06-19 12:44:14.156	panic: interface conversion: queue.Request is nil, not *queue.SchedulerRequest
2024-06-19 12:44:14.156	
2024-06-19 12:44:14.156	goroutine 56556469 [running]:
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/scheduler.(*Scheduler).QuerierLoop(0xc001b5b208, {0x3d4d300, 0xc0010d9c80})
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/scheduler/scheduler.go:430 +0x575
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/scheduler/schedulerpb._SchedulerForQuerier_QuerierLoop_Handler({0x3407ca0?, 0xc001b5b208}, {0x3d39b38, 0xc003180cc0})
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/scheduler/schedulerpb/scheduler.pb.go:885 +0xd8
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/mimir.ThanosTracerStreamInterceptor({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0x5?, 0x37c8e60)
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/mimir/tracing.go:25 +0xcf
2024-06-19 12:44:14.163	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.163		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/mimir/pkg/util/noauth.SetupAuthMiddleware.func2({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38, 0xc001487d00)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/mimir/pkg/util/noauth/no_auth.go:38 +0x73
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/mimir/pkg/querier/api.ReadConsistencyServerStreamInterceptor({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38?, 0xc001487cc0)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/api/consistency.go:96 +0x1a2
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/dskit/middleware.StreamServerInstrumentInterceptor.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38, 0xc001487c80)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:57 +0xbd
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/opentracing-contrib/go-grpc.OpenTracingStreamServerInterceptor.func1({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc001487ac0)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/opentracing-contrib/go-grpc/server.go:114 +0x34a
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/dskit/middleware.GRPCServerLog.StreamServerInterceptor({{0x3d03d60?, 0xc0011ef800?}, 0xa0?, 0x92?}, {0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc001487a80)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_logging.go:87 +0xab
2024-06-19 12:44:14.175	google.golang.org/grpc.NewServer.chainStreamServerInterceptors.chainStreamInterceptors.func2({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc0010d9bf0?)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1506 +0x85
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).processStreamingRPC(0xc001bc9600, {0x3d2deb8, 0xc005a082a0}, {0x3d53960, 0xc00176e600}, 0xc0031a2900, 0xc001b67380, 0x5786120, 0x0)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1670 +0x11e7
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).handleStream(0xc001bc9600, {0x3d53960, 0xc00176e600}, 0xc0031a2900)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1784 +0xe3a
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).serveStreams.func2.1()
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1019 +0x8b
2024-06-19 12:44:14.175	created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 56556440
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1030 +0x125

This panic happened as the server was shutting down. The cause is send's select branches executing in arbitrary order when the context is canceled, so it sends no value and closes the channel, resulting in consumer code reading a zero-valued (nil) request. This PR just makes send not close the channel.

I added a test that reproduces this a ~few times per 100K runs. Before:

davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
--- FAIL: TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled (0.00s)
    queue_test.go:663:
                Error Trace:    /Users/davidgrant/dev/mimir/pkg/scheduler/queue/queue_test.go:663
                Error:          An error is expected but got nil.
                Test:           TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled
--- FAIL: TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled (0.00s)
    queue_test.go:663:
                Error Trace:    /Users/davidgrant/dev/mimir/pkg/scheduler/queue/queue_test.go:663
                Error:          An error is expected but got nil.
                Test:           TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled
FAIL
FAIL    github.com/grafana/mimir/pkg/scheduler/queue    2.416s
FAIL

After:

davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.234s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.212s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.231s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.211s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.200s

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…d-panic

francoposa · 2024-06-20T22:17:21Z

pkg/scheduler/queue/queue.go

+			if !ok {
+				// Channel was closed without having a value written to it. This can happen if
+				// the write context was canceled.
+				return nil, last, ErrStopped


at this point we just think that a stop has caused this, returning ErrStopped may not be accurate.
ErrStopped has control-flow implications with the entire process so we don't want to return it if we are not sure - note the other situations where we send ErrStopped.
If we really are shutting down, not sending ErrStopped is still fine as context-canceled requests being dequeued is part of the normal flow handled while shutting down.

I think it might be better to fix this is in send

Right now send looks like this:

func (wqc *waitingQuerierConn) send(req requestForQuerier) bool { defer close(wqc.recvChan) select { case wqc.recvChan <- req: return true case <-wqc.querierConnCtx.Done(): return false } }

but the select statement waiting on this end right below here is also holding a reference to the context and waiting for a Done(). When it experiences this, it returns return nil, last, ctx.Err() - I think it's preferable to hit that case and let the ctx.Err() get returned.

func (wqc *waitingQuerierConn) send(req requestForQuerier) bool { select { case wqc.recvChan <- req: close(wqc.recvChan) return true case <-wqc.querierConnCtx.Done(): // do not close channel here; // waiting goroutine in WaitForRequestForQuerier will receive the context cancel and handle it // simply notify the caller in trySendNextRequestForQuerier that we did not successfully send a request return false } }

I think the reason this has shown up so rarely is that precisely this, that the receiver here is also waiting on select case of context.Done() with the same context. Select branches are randomized so you ended up in a case rare case where both are true - the channel has been closed without anything written, and the context is canceled, but we handled the channel-close-read-nil case before we handled the context cancel

Cool. I was going to ask you about that, as it wasn't clear what error to return. Does closing recvChan still have value if we change send so send+close are always paired?

yeah I guess we don't really need to close it. still worth maybe including a comment about why we don't close it.

There's definitely a maybe cleaner alternate implementation here where instead of both sides checking the same context cancel, only the send side checks it context, then it sends the ctx.Err() to an error channel.
That way the receiving side can wait on either the recvChan or the errChan instead of checking either recvChan or context cancel. The reason we don't use an errChan right now is that we always want to send the lastTenantIndex back no matter what - that might go away in the future but either way probably not worth fiddling with for now.

Cool. Let me know how that looks.

There's definitely a maybe cleaner alternate implementation

The way that WaitForRequestForQuerier looks directly at ctx.Done() and gives up when that is closed definitely has clarity going for it. Any time things are plumbed through multiple channels I have to start drawing pictures. But, I'm not very intimate with this particular code so you might be right.

dimitarvdimitrov · 2024-06-21T13:35:44Z

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

francoposa · 2024-06-21T20:11:36Z

pkg/scheduler/queue/queue_test.go

@@ -622,6 +622,48 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnImmediatelyIfQuerierI
 	require.EqualError(t, err, "querier has informed the scheduler it is shutting down")
 }

+func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled(t *testing.T) {
+	// This tests a rare (~5 in 100K test runs) race where a canceled querier


I am not sure I understand this - the test always passes now that the fix is in, but it almost always passes without the fix too?

Yeah, the old "add a failing test and fix it" dance. It's just a test that didn't fail 100% of the time.

Perhaps no comment would be better, and future archeologists can follow the PR history.

in wondering whether we should a have test at all that passes 99% of the time even when the fix isn’t there.
but I do not feel strongly either way

It just becomes coverage for a codepath that wasn't quite tested before, I don't think. I've just expanded one of your existing tests to also exercise that code.

francoposa · 2024-06-21T20:12:52Z

pkg/scheduler/queue/queue.go

@@ -437,7 +437,8 @@ func (q *RequestQueue) WaitForRequestForQuerier(ctx context.Context, last Tenant
 		querierConnCtx:  ctx,
 		querierID:       QuerierID(querierID),
 		lastTenantIndex: last,
-		recvChan:        make(chan requestForQuerier),
+		// recvChan is written to by the dispatcher loop and read below. It is never closed.


let's just move a comment about this down to send where it is actually relevant

I am actually dubious of adding this comment at all. I don't think it's surprising or notable to leave a channel open. Seeing a channel closed is notable, as I know I'll need to make sure writers and readers are handling the closed channel appropriately. How about I just delete the comment?

deleting sounds good to me

francoposa

approved pending some minor docstring/comment cleanup

…d-panic

francoposa · 2024-06-22T17:35:27Z

pkg/scheduler/queue/queue_test.go

@@ -574,6 +574,16 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnAfterContextCancelled
 	})

 	queue.SubmitRegisterQuerierConnection(querierID)
+
+	// Calling WaitForRequestForQuerier with a context that is already cancelled should fail immediately.


nice i like this

Fix a panic race when the writer context is canceled.

be697a3

seizethedave changed the title ~~Fix a panic race when the writer context is canceled.~~ Query scheduler: Fix a panic race when the writer context is canceled. Jun 20, 2024

Add a test.

8af7c87

seizethedave marked this pull request as ready for review June 20, 2024 21:56

seizethedave requested a review from a team as a code owner June 20, 2024 21:56

seizethedave changed the title ~~Query scheduler: Fix a panic race when the writer context is canceled.~~ Query scheduler: Fix a panic race in request queue handling Jun 20, 2024

seizethedave added 2 commits June 20, 2024 15:00

changelog

0380df2

Merge remote-tracking branch 'origin/main' into davidgrant/query-sche…

1279489

…d-panic

francoposa reviewed Jun 20, 2024

View reviewed changes

seizethedave added 3 commits June 20, 2024 15:44

Don't close the channel.

a71b4cb

Adjust test now that we're not closing the channel.

7b7d375

Note about recvChan staying open.

440d108

seizethedave requested a review from francoposa June 21, 2024 00:18

dimitarvdimitrov added the release/notified-changelog-cut label Jun 21, 2024

seizethedave added 2 commits June 21, 2024 09:40

merge

044ac46

bad merge

f93d7c3

francoposa reviewed Jun 21, 2024

View reviewed changes

francoposa approved these changes Jun 21, 2024

View reviewed changes

seizethedave added 2 commits June 21, 2024 20:00

Combine with similar test.

e0c4b4e

Merge remote-tracking branch 'origin/main' into davidgrant/query-sche…

31a2316

…d-panic

francoposa reviewed Jun 22, 2024

View reviewed changes

seizethedave merged commit 3410422 into main Jun 22, 2024
29 checks passed

seizethedave deleted the davidgrant/query-sched-panic branch June 22, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query scheduler: Fix a panic race in request queue handling #8451

Query scheduler: Fix a panic race in request queue handling #8451

seizethedave commented Jun 20, 2024 •

edited

Loading

francoposa Jun 20, 2024 •

edited

Loading

seizethedave Jun 20, 2024 •

edited

Loading

francoposa Jun 20, 2024

seizethedave Jun 21, 2024

dimitarvdimitrov commented Jun 21, 2024

francoposa Jun 21, 2024

seizethedave Jun 21, 2024

seizethedave Jun 21, 2024

francoposa Jun 21, 2024

seizethedave Jun 22, 2024

francoposa Jun 21, 2024

seizethedave Jun 21, 2024

francoposa Jun 21, 2024

francoposa left a comment

francoposa Jun 22, 2024

Query scheduler: Fix a panic race in request queue handling #8451

Query scheduler: Fix a panic race in request queue handling #8451

Conversation

seizethedave commented Jun 20, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

francoposa Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

seizethedave Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoposa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seizethedave commented Jun 20, 2024 •

edited

Loading

francoposa Jun 20, 2024 •

edited

Loading

seizethedave Jun 20, 2024 •

edited

Loading