Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: fix an infinite loop in the DistSender #51085

Merged
merged 1 commit into from
Jul 8, 2020

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Jul 7, 2020

This patch fixes some silly code which deals with the situation in which
sendToReplicas() needs to try another replica, but some of the replicas
with which it started are known to be stale. The code tries to skip the
stale replicas except that instead of skipping anything, it was just
looping endlessly.

This should fix recent timeouts of tests with stacktraces in
DistSender.sendToReplicas().

Fixes #51061

Release note: None

@andreimatei andreimatei requested a review from tbg July 7, 2020 19:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @knz enjoy vacay

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Add a PR comment that this is a likely contributor to recent test failures in which TestClusters deadlocked during shutdown with a stack in sendTToReplicas.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvclient/kvcoord/transport.go, line 87 at r1 (raw file):

	// - the replica that NextReplica() would return is skipped.
	//
	// Returns true if, after skipping the transport is not exhausted. Returns

The return value seems useless - we have IsExhausted() already. You're also not using it in production code, so I would prefer not returning anything here to keep things clean and simple.


pkg/kv/kvclient/kvcoord/transport.go, line 397 at r1 (raw file):

func (s *senderTransport) SkipReplica() bool {
	s.called = true

Can you add a comment on this field about when it is set? And then a comment here why we set it here.

This patch fixes some silly code which deals with the situation in which
sendToReplicas() needs to try another replica, but some of the replicas
with which it started are known to be stale. The code tries to skip the
stale replicas except that instead of skipping anything, it was just
looping endlessly.

This should fix recent timeouts of tests with stacktraces in
DistSender.sendToReplicas().

Fixes cockroachdb#51061

Release note: None
@andreimatei andreimatei force-pushed the fix-distsender-loop branch from f7a7ef5 to 8aff78b Compare July 8, 2020 15:21
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a PR comment that this is a likely contributor to recent test failures in which TestClusters deadlocked during shutdown with a stack in sendTToReplicas.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/kv/kvclient/kvcoord/transport.go, line 87 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The return value seems useless - we have IsExhausted() already. You're also not using it in production code, so I would prefer not returning anything here to keep things clean and simple.

done


pkg/kv/kvclient/kvcoord/transport.go, line 397 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you add a comment on this field about when it is set? And then a comment here why we set it here.

added some words

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a PR comment that this is a likely contributor to recent test failures in which TestClusters deadlocked during shutdown with a stack in sendTToReplicas.

done

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

@andreimatei
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 8, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: testcluster can deadlock on shutdown
3 participants