Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: Correctly handle stuck rangefeeds #92582

Merged
merged 1 commit into from
Nov 29, 2022
Merged

Conversation

miretskiy
Copy link
Contributor

Fixes #92570

A watcher responsible for restarting stuck range feeds may incorrectly cancel rangefeed if the downstream event consumer is slow.

Release note (bug fix): Fix incorrect cancellation logic when attempting to detect stuck range feeds.

@miretskiy miretskiy requested a review from a team as a code owner November 28, 2022 15:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -71,11 +72,11 @@ func (w *stuckRangeFeedCanceler) stop() {

// ping notifies the canceler that the rangefeed has received an
// event, i.e. is making progress.
func (w *stuckRangeFeedCanceler) ping() {
func (w *stuckRangeFeedCanceler) do(fn func() error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is currently structured, we don't need to return error from either of these functions. The closure can close over err and we can check it outside of do().

There are a couple of other approaches we can consider:

  • Have do() also do error handling, returning errRestartStuckRange when the rangefeed is stuck.

  • Return true if the rangefeed is stuck when the function returns.

These would allow us to get rid of the stuck() method and simplify the caller logic. I think I have a preference for returning errRestartStuckRange and encapsulating as much of the logic as possible inside do(), but it's not a strongly held view. The current approach is also fine.

Copy link
Contributor Author

@miretskiy miretskiy Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack -- I was thinking of doing that. Still, I think returning an error is better -- I prefer not to close over errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I can get rid of stuck() -- we call it in few places , and I think it just reads better when error handling is explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not sure why we bother with returning errStuckRangefeed from do() in that case, since none of the callers use it anyway. You call though, nbd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted return of errStuckRange.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I can get rid of stuck() -- but other comments addressed.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments, the core logic seems fine. Preemptively approving once addressed.

We should also have a test, either for the DistSender or as a CDC roachtest, which makes sure a stalled sink won't cause undue restarts.

@@ -71,11 +72,11 @@ func (w *stuckRangeFeedCanceler) stop() {

// ping notifies the canceler that the rangefeed has received an
// event, i.e. is making progress.
func (w *stuckRangeFeedCanceler) ping() {
func (w *stuckRangeFeedCanceler) do(fn func() error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not sure why we bother with returning errStuckRangefeed from do() in that case, since none of the callers use it anyway. You call though, nbd.

@miretskiy
Copy link
Contributor Author

ely approving once addressed.

We should also have a test, either for the DistSender or as a CDC roachtest

Added canceller test.

Fixes cockroachdb#92570

A watcher responsible for restarting stuck range feeds
may incorrectly cancel rangefeed if the downstream event
consumer is slow.

Release note (bug fix): Fix incorrect cancellation logic
when attempting to detect stuck range feeds.
@miretskiy
Copy link
Contributor Author

bors r+

1 similar comment
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 29, 2022

Already running a review

@craig craig bot merged commit edc6fda into cockroachdb:master Nov 29, 2022
@craig
Copy link
Contributor

craig bot commented Nov 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvcoord: stuckRangeFeedCanceler can fire during event processing
3 participants