Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.2: rangefeed: fix scheduler catchup iterator race #114379

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Nov 13, 2023

Backport 1/1 commits from #114240 on behalf of @erikgrinaker.

/cc @cockroachdb/release


It was possible for the scheduled processor to hand ownership of the catchup iterator over to the registration, but claim that it didn't by returning false from Register().

This can happen if the registration request is queued concurrently with a processor shutdown, where the registration will execute the catchup scan and close the iterator, but the caller will think it wasn't registered and double-close the iterator.

This patch fixes the race, and also documents the necessary invariant along with a runtime assertion.

Resolves #114192.
Epic: none
Release note: None


Release justification: fixes a potential panic.

@blathers-crl blathers-crl bot requested a review from a team November 13, 2023 21:40
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.2-114240 branch from 6788041 to 8cf692e Compare November 13, 2023 21:40
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Nov 13, 2023
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.2-114240 branch from a1aff47 to cb5295b Compare November 13, 2023 21:40
Copy link
Author

blathers-crl bot commented Nov 13, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Nov 13, 2023
Copy link
Author

blathers-crl bot commented Nov 13, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @blathers-crl[bot] and @erikgrinaker)


pkg/kv/kvserver/rangefeed/scheduled_processor.go line 591 at r1 (raw file):

		// Assert that we never process requests after stoppedC is closed. This is
		// necessary to coordinate catchup iter ownership and avoid double-closing.
		if buildutil.CrdbTestBuild {

This assertion is safe to make because it's run on the scheduler goroutine, right? So we know that p.stoppedC can't be closed concurrently with the assertion?


pkg/kv/kvserver/rangefeed/scheduled_processor.go line 603 at r1 (raw file):

		return r
	case <-p.stoppedC:
		// If the request was processed concurrently with a stop, there's a 50%

"processed concurrently"

Is this the right phrasing? The request is processed on the same thread of execution that closes stoppedC, right? So this isn't guarding against the case where the processing is concurrent with a stop, but rather, when the request is processed (and result signaled) and the scheduler is stopped (and stoppedC closed) in quick succession.

I guess that is concurrent from the perspective of external users of the scheduler, but since we're listening on internal channels in this code, we're not exactly an external user. Spelling this out might help readers though, or at least those who are just starting to familiarize themselves with this code.

It was possible for the scheduled processor to hand ownership of the
catchup iterator over to the registration, but claim that it didn't by
returning `false` from `Register()`.

This can happen if the registration request is queued concurrently with
a processor shutdown, where the registration will execute the catchup
scan and close the iterator, but the caller will think it wasn't
registered and double-close the iterator.

This patch fixes the race, and also documents the necessary invariant
along with a runtime assertion.

Epic: none
Release note: None
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/kvserver/rangefeed/scheduled_processor.go line 591 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This assertion is safe to make because it's run on the scheduler goroutine, right? So we know that p.stoppedC can't be closed concurrently with the assertion?

That's right.


pkg/kv/kvserver/rangefeed/scheduled_processor.go line 603 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"processed concurrently"

Is this the right phrasing? The request is processed on the same thread of execution that closes stoppedC, right? So this isn't guarding against the case where the processing is concurrent with a stop, but rather, when the request is processed (and result signaled) and the scheduler is stopped (and stoppedC closed) in quick succession.

I guess that is concurrent from the perspective of external users of the scheduler, but since we're listening on internal channels in this code, we're not exactly an external user. Spelling this out might help readers though, or at least those who are just starting to familiarize themselves with this code.

That's right. From the point of view of this select, they were processed concurrently -- it can observe both result and stoppedC becoming ready at the same time, and that's the condition for the race. But you're right that the actual processing and stopping did not overlap in real time.

Updated the comments here, submitted #114493 for master.

@erikgrinaker erikgrinaker merged commit de8d3d0 into release-23.2 Nov 15, 2023
@erikgrinaker erikgrinaker deleted the blathers/backport-release-23.2-114240 branch November 15, 2023 16:24
craig bot pushed a commit that referenced this pull request Nov 29, 2023
114493: rangefeed: clarify `runRequest()` comments r=erikgrinaker a=erikgrinaker

See #114379 (review).

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants