-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.2: rangefeed: fix scheduler catchup iterator race #114379
release-23.2: rangefeed: fix scheduler catchup iterator race #114379
Conversation
6788041
to
8cf692e
Compare
a1aff47
to
cb5295b
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @blathers-crl[bot] and @erikgrinaker)
pkg/kv/kvserver/rangefeed/scheduled_processor.go
line 591 at r1 (raw file):
// Assert that we never process requests after stoppedC is closed. This is // necessary to coordinate catchup iter ownership and avoid double-closing. if buildutil.CrdbTestBuild {
This assertion is safe to make because it's run on the scheduler goroutine, right? So we know that p.stoppedC
can't be closed concurrently with the assertion?
pkg/kv/kvserver/rangefeed/scheduled_processor.go
line 603 at r1 (raw file):
return r case <-p.stoppedC: // If the request was processed concurrently with a stop, there's a 50%
"processed concurrently"
Is this the right phrasing? The request is processed on the same thread of execution that closes stoppedC
, right? So this isn't guarding against the case where the processing is concurrent with a stop, but rather, when the request is processed (and result
signaled) and the scheduler is stopped (and stoppedC
closed) in quick succession.
I guess that is concurrent from the perspective of external users of the scheduler, but since we're listening on internal channels in this code, we're not exactly an external user. Spelling this out might help readers though, or at least those who are just starting to familiarize themselves with this code.
It was possible for the scheduled processor to hand ownership of the catchup iterator over to the registration, but claim that it didn't by returning `false` from `Register()`. This can happen if the registration request is queued concurrently with a processor shutdown, where the registration will execute the catchup scan and close the iterator, but the caller will think it wasn't registered and double-close the iterator. This patch fixes the race, and also documents the necessary invariant along with a runtime assertion. Epic: none Release note: None
cb5295b
to
e2b7c6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
pkg/kv/kvserver/rangefeed/scheduled_processor.go
line 591 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This assertion is safe to make because it's run on the scheduler goroutine, right? So we know that
p.stoppedC
can't be closed concurrently with the assertion?
That's right.
pkg/kv/kvserver/rangefeed/scheduled_processor.go
line 603 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"processed concurrently"
Is this the right phrasing? The request is processed on the same thread of execution that closes
stoppedC
, right? So this isn't guarding against the case where the processing is concurrent with a stop, but rather, when the request is processed (andresult
signaled) and the scheduler is stopped (andstoppedC
closed) in quick succession.I guess that is concurrent from the perspective of external users of the scheduler, but since we're listening on internal channels in this code, we're not exactly an external user. Spelling this out might help readers though, or at least those who are just starting to familiarize themselves with this code.
That's right. From the point of view of this select
, they were processed concurrently -- it can observe both result
and stoppedC
becoming ready at the same time, and that's the condition for the race. But you're right that the actual processing and stopping did not overlap in real time.
Updated the comments here, submitted #114493 for master
.
114493: rangefeed: clarify `runRequest()` comments r=erikgrinaker a=erikgrinaker See #114379 (review). Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>
Backport 1/1 commits from #114240 on behalf of @erikgrinaker.
/cc @cockroachdb/release
It was possible for the scheduled processor to hand ownership of the catchup iterator over to the registration, but claim that it didn't by returning
false
fromRegister()
.This can happen if the registration request is queued concurrently with a processor shutdown, where the registration will execute the catchup scan and close the iterator, but the caller will think it wasn't registered and double-close the iterator.
This patch fixes the race, and also documents the necessary invariant along with a runtime assertion.
Resolves #114192.
Epic: none
Release note: None
Release justification: fixes a potential panic.