Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query scheduler: Fix a panic race in request queue handling #8451

Merged
merged 11 commits into from
Jun 22, 2024

Conversation

seizethedave
Copy link
Contributor

@seizethedave seizethedave commented Jun 20, 2024

What this PR does

Fixes a rarely-seen panic:

2024-06-19 12:44:14.156	panic: interface conversion: queue.Request is nil, not *queue.SchedulerRequest
2024-06-19 12:44:14.156	
2024-06-19 12:44:14.156	goroutine 56556469 [running]:
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/scheduler.(*Scheduler).QuerierLoop(0xc001b5b208, {0x3d4d300, 0xc0010d9c80})
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/scheduler/scheduler.go:430 +0x575
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/scheduler/schedulerpb._SchedulerForQuerier_QuerierLoop_Handler({0x3407ca0?, 0xc001b5b208}, {0x3d39b38, 0xc003180cc0})
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/scheduler/schedulerpb/scheduler.pb.go:885 +0xd8
2024-06-19 12:44:14.163	github.com/grafana/mimir/pkg/mimir.ThanosTracerStreamInterceptor({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0x5?, 0x37c8e60)
2024-06-19 12:44:14.163		/drone/src/vendor/github.com/grafana/mimir/pkg/mimir/tracing.go:25 +0xcf
2024-06-19 12:44:14.163	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.163		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/mimir/pkg/util/noauth.SetupAuthMiddleware.func2({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38, 0xc001487d00)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/mimir/pkg/util/noauth/no_auth.go:38 +0x73
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/mimir/pkg/querier/api.ReadConsistencyServerStreamInterceptor({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38?, 0xc001487cc0)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/api/consistency.go:96 +0x1a2
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/dskit/middleware.StreamServerInstrumentInterceptor.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0}, 0xc00474cd38, 0xc001487c80)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:57 +0xbd
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3b358, 0xc003180ca0})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/opentracing-contrib/go-grpc.OpenTracingStreamServerInterceptor.func1({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc001487ac0)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/opentracing-contrib/go-grpc/server.go:114 +0x34a
2024-06-19 12:44:14.175	google.golang.org/grpc.getChainStreamHandler.func1({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950})
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1515 +0xb2
2024-06-19 12:44:14.175	github.com/grafana/dskit/middleware.GRPCServerLog.StreamServerInterceptor({{0x3d03d60?, 0xc0011ef800?}, 0xa0?, 0x92?}, {0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc001487a80)
2024-06-19 12:44:14.175		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_logging.go:87 +0xab
2024-06-19 12:44:14.175	google.golang.org/grpc.NewServer.chainStreamServerInterceptors.chainStreamInterceptors.func2({0x3407ca0, 0xc001b5b208}, {0x3d3ac88, 0xc003951950}, 0xc00474cd38, 0xc0010d9bf0?)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1506 +0x85
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).processStreamingRPC(0xc001bc9600, {0x3d2deb8, 0xc005a082a0}, {0x3d53960, 0xc00176e600}, 0xc0031a2900, 0xc001b67380, 0x5786120, 0x0)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1670 +0x11e7
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).handleStream(0xc001bc9600, {0x3d53960, 0xc00176e600}, 0xc0031a2900)
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1784 +0xe3a
2024-06-19 12:44:14.175	google.golang.org/grpc.(*Server).serveStreams.func2.1()
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1019 +0x8b
2024-06-19 12:44:14.175	created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 56556440
2024-06-19 12:44:14.175		/drone/src/vendor/google.golang.org/grpc/server.go:1030 +0x125

This panic happened as the server was shutting down. The cause is send's select branches executing in arbitrary order when the context is canceled, so it sends no value and closes the channel, resulting in consumer code reading a zero-valued (nil) request. This PR just makes send not close the channel.

I added a test that reproduces this a ~few times per 100K runs. Before:

davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
--- FAIL: TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled (0.00s)
    queue_test.go:663:
                Error Trace:    /Users/davidgrant/dev/mimir/pkg/scheduler/queue/queue_test.go:663
                Error:          An error is expected but got nil.
                Test:           TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled
--- FAIL: TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled (0.00s)
    queue_test.go:663:
                Error Trace:    /Users/davidgrant/dev/mimir/pkg/scheduler/queue/queue_test.go:663
                Error:          An error is expected but got nil.
                Test:           TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled
FAIL
FAIL    github.com/grafana/mimir/pkg/scheduler/queue    2.416s
FAIL

After:

davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.234s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.212s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.231s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.211s
davidgrant@rube mimir % go test -timeout 30s -count 100000 -run ^TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled$ github.com/grafana/mimir/pkg/scheduler/queue
ok      github.com/grafana/mimir/pkg/scheduler/queue    2.200s

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@seizethedave seizethedave changed the title Fix a panic race when the writer context is canceled. Query scheduler: Fix a panic race when the writer context is canceled. Jun 20, 2024
@seizethedave seizethedave marked this pull request as ready for review June 20, 2024 21:56
@seizethedave seizethedave requested a review from a team as a code owner June 20, 2024 21:56
@seizethedave seizethedave changed the title Query scheduler: Fix a panic race when the writer context is canceled. Query scheduler: Fix a panic race in request queue handling Jun 20, 2024
if !ok {
// Channel was closed without having a value written to it. This can happen if
// the write context was canceled.
return nil, last, ErrStopped
Copy link
Member

@francoposa francoposa Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point we just think that a stop has caused this, returning ErrStopped may not be accurate.
ErrStopped has control-flow implications with the entire process so we don't want to return it if we are not sure - note the other situations where we send ErrStopped.
If we really are shutting down, not sending ErrStopped is still fine as context-canceled requests being dequeued is part of the normal flow handled while shutting down.

I think it might be better to fix this is in send

Right now send looks like this:

func (wqc *waitingQuerierConn) send(req requestForQuerier) bool {
	defer close(wqc.recvChan)

	select {
	case wqc.recvChan <- req:
		return true
	case <-wqc.querierConnCtx.Done():
		return false
	}
}

but the select statement waiting on this end right below here is also holding a reference to the context and waiting for a Done(). When it experiences this, it returns return nil, last, ctx.Err() - I think it's preferable to hit that case and let the ctx.Err() get returned.

func (wqc *waitingQuerierConn) send(req requestForQuerier) bool {
	select {
	case wqc.recvChan <- req:
        close(wqc.recvChan)
		return true
	case <-wqc.querierConnCtx.Done():
        // do not close channel here;
        // waiting goroutine in WaitForRequestForQuerier will receive the context cancel and handle it
        // simply notify the caller in trySendNextRequestForQuerier that we did not successfully send a request
		return false
	}
}

I think the reason this has shown up so rarely is that precisely this, that the receiver here is also waiting on select case of context.Done() with the same context. Select branches are randomized so you ended up in a case rare case where both are true - the channel has been closed without anything written, and the context is canceled, but we handled the channel-close-read-nil case before we handled the context cancel

Copy link
Contributor Author

@seizethedave seizethedave Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I was going to ask you about that, as it wasn't clear what error to return. Does closing recvChan still have value if we change send so send+close are always paired?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I guess we don't really need to close it. still worth maybe including a comment about why we don't close it.

There's definitely a maybe cleaner alternate implementation here where instead of both sides checking the same context cancel, only the send side checks it context, then it sends the ctx.Err() to an error channel.
That way the receiving side can wait on either the recvChan or the errChan instead of checking either recvChan or context cancel. The reason we don't use an errChan right now is that we always want to send the lastTenantIndex back no matter what - that might go away in the future but either way probably not worth fiddling with for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let me know how that looks.

There's definitely a maybe cleaner alternate implementation

The way that WaitForRequestForQuerier looks directly at ctx.Done() and gives up when that is closed definitely has clarity going for it. Any time things are plumbed through multiple channels I have to start drawing pictures. But, I'm not very intimate with this particular code so you might be right.

@seizethedave seizethedave requested a review from francoposa June 21, 2024 00:18
@dimitarvdimitrov
Copy link
Contributor

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

@@ -622,6 +622,48 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnImmediatelyIfQuerierI
require.EqualError(t, err, "querier has informed the scheduler it is shutting down")
}

func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled(t *testing.T) {
// This tests a rare (~5 in 100K test runs) race where a canceled querier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this - the test always passes now that the fix is in, but it almost always passes without the fix too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the old "add a failing test and fix it" dance. It's just a test that didn't fail 100% of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps no comment would be better, and future archeologists can follow the PR history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in wondering whether we should a have test at all that passes 99% of the time even when the fix isn’t there.
but I do not feel strongly either way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just becomes coverage for a codepath that wasn't quite tested before, I don't think. I've just expanded one of your existing tests to also exercise that code.

@@ -437,7 +437,8 @@ func (q *RequestQueue) WaitForRequestForQuerier(ctx context.Context, last Tenant
querierConnCtx: ctx,
querierID: QuerierID(querierID),
lastTenantIndex: last,
recvChan: make(chan requestForQuerier),
// recvChan is written to by the dispatcher loop and read below. It is never closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just move a comment about this down to send where it is actually relevant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually dubious of adding this comment at all. I don't think it's surprising or notable to leave a channel open. Seeing a channel closed is notable, as I know I'll need to make sure writers and readers are handling the closed channel appropriately. How about I just delete the comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleting sounds good to me

Copy link
Member

@francoposa francoposa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved pending some minor docstring/comment cleanup

@@ -574,6 +574,16 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnAfterContextCancelled
})

queue.SubmitRegisterQuerierConnection(querierID)

// Calling WaitForRequestForQuerier with a context that is already cancelled should fail immediately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice i like this

@seizethedave seizethedave merged commit 3410422 into main Jun 22, 2024
29 checks passed
@seizethedave seizethedave deleted the davidgrant/query-sched-panic branch June 22, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants