-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query scheduler: Fix a panic race in request queue handling #8451
Conversation
pkg/scheduler/queue/queue.go
Outdated
if !ok { | ||
// Channel was closed without having a value written to it. This can happen if | ||
// the write context was canceled. | ||
return nil, last, ErrStopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point we just think that a stop has caused this, returning ErrStopped may not be accurate.
ErrStopped has control-flow implications with the entire process so we don't want to return it if we are not sure - note the other situations where we send ErrStopped.
If we really are shutting down, not sending ErrStopped is still fine as context-canceled requests being dequeued is part of the normal flow handled while shutting down.
I think it might be better to fix this is in send
Right now send looks like this:
func (wqc *waitingQuerierConn) send(req requestForQuerier) bool {
defer close(wqc.recvChan)
select {
case wqc.recvChan <- req:
return true
case <-wqc.querierConnCtx.Done():
return false
}
}
but the select statement waiting on this end right below here is also holding a reference to the context and waiting for a Done()
. When it experiences this, it returns return nil, last, ctx.Err()
- I think it's preferable to hit that case and let the ctx.Err() get returned.
func (wqc *waitingQuerierConn) send(req requestForQuerier) bool {
select {
case wqc.recvChan <- req:
close(wqc.recvChan)
return true
case <-wqc.querierConnCtx.Done():
// do not close channel here;
// waiting goroutine in WaitForRequestForQuerier will receive the context cancel and handle it
// simply notify the caller in trySendNextRequestForQuerier that we did not successfully send a request
return false
}
}
I think the reason this has shown up so rarely is that precisely this, that the receiver here is also waiting on select case of context.Done() with the same context. Select branches are randomized so you ended up in a case rare case where both are true - the channel has been closed without anything written, and the context is canceled, but we handled the channel-close-read-nil case before we handled the context cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I was going to ask you about that, as it wasn't clear what error to return. Does closing recvChan
still have value if we change send
so send+close are always paired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess we don't really need to close it. still worth maybe including a comment about why we don't close it.
There's definitely a maybe cleaner alternate implementation here where instead of both sides checking the same context cancel, only the send
side checks it context, then it sends the ctx.Err()
to an error channel.
That way the receiving side can wait on either the recvChan or the errChan instead of checking either recvChan or context cancel. The reason we don't use an errChan right now is that we always want to send the lastTenantIndex
back no matter what - that might go away in the future but either way probably not worth fiddling with for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Let me know how that looks.
There's definitely a maybe cleaner alternate implementation
The way that WaitForRequestForQuerier
looks directly at ctx.Done()
and gives up when that is closed definitely has clarity going for it. Any time things are plumbed through multiple channels I have to start drawing pictures. But, I'm not very intimate with this particular code so you might be right.
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase |
pkg/scheduler/queue/queue_test.go
Outdated
@@ -622,6 +622,48 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnImmediatelyIfQuerierI | |||
require.EqualError(t, err, "querier has informed the scheduler it is shutting down") | |||
} | |||
|
|||
func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnErrorIfCtxCanceled(t *testing.T) { | |||
// This tests a rare (~5 in 100K test runs) race where a canceled querier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this - the test always passes now that the fix is in, but it almost always passes without the fix too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the old "add a failing test and fix it" dance. It's just a test that didn't fail 100% of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps no comment would be better, and future archeologists can follow the PR history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in wondering whether we should a have test at all that passes 99% of the time even when the fix isn’t there.
but I do not feel strongly either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just becomes coverage for a codepath that wasn't quite tested before, I don't think. I've just expanded one of your existing tests to also exercise that code.
pkg/scheduler/queue/queue.go
Outdated
@@ -437,7 +437,8 @@ func (q *RequestQueue) WaitForRequestForQuerier(ctx context.Context, last Tenant | |||
querierConnCtx: ctx, | |||
querierID: QuerierID(querierID), | |||
lastTenantIndex: last, | |||
recvChan: make(chan requestForQuerier), | |||
// recvChan is written to by the dispatcher loop and read below. It is never closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just move a comment about this down to send
where it is actually relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually dubious of adding this comment at all. I don't think it's surprising or notable to leave a channel open. Seeing a channel closed is notable, as I know I'll need to make sure writers and readers are handling the closed channel appropriately. How about I just delete the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleting sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved pending some minor docstring/comment cleanup
@@ -574,6 +574,16 @@ func TestRequestQueue_GetNextRequestForQuerier_ShouldReturnAfterContextCancelled | |||
}) | |||
|
|||
queue.SubmitRegisterQuerierConnection(querierID) | |||
|
|||
// Calling WaitForRequestForQuerier with a context that is already cancelled should fail immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice i like this
What this PR does
Fixes a rarely-seen panic:
This panic happened as the server was shutting down. The cause is send's select branches executing in arbitrary order when the context is canceled, so it sends no value and closes the channel, resulting in consumer code reading a zero-valued (nil) request. This PR just makes
send
not close the channel.I added a test that reproduces this a ~few times per 100K runs. Before:
After:
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.