-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsqlrun: Fix RowChannel race in outbox upon context cancellation. #17870
Conversation
Please make sure to unskip the tests 😄 |
Do you think this fixes #17869 as well?
If so, mind adding it to your PR description? |
c1cbef1
to
53715bc
Compare
After an offline chat with @andreimatei I decided to implement a specialized This eliminates race conditions in PTAL! |
LGTM PR and commit messages should be identical. Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. pkg/sql/distsql_running.go, line 228 at r2 (raw file):
comment that this is supposed to be set atomically "has been async set" -> async wrt what? I'd drop it. pkg/sql/distsqlrun/base.go, line 96 at r2 (raw file):
hint who implements this and say something about why cancellation has to be signaled this way to the DistSQLReceiver but not to other consumers. pkg/sql/distsqlplan/aggregator_funcs_test.go, line 61 at r1 (raw file):
comment literals inline, here and elsewhere Comments from Reviewable |
The race was being caused in the `RunSyncFlow()` case only: when the flow's `syncFlowConsumer` is an outbox, not a distSQLReceiver. If `flow.cancel` runs after `outbox.ProducerDone()` has been called by a router/processor, and tries to push an error into it, the outbox panics since its `RowChannel` is already closed. This PR fixes this race by adding the ability to mark the `distSQLReceiver` on the gateway node as cancelled asynchronously of the `Push`/`ProducerDone` calls, and more importantly, doing nothing when the `syncFlowConsumer` is an outbox (or anything other than a distSQLReceiver). Fixes cockroachdb#17851, fixes cockroachdb#17864.
53715bc
to
c92af32
Compare
Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions. pkg/sql/distsql_running.go, line 228 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/distsqlrun/base.go, line 96 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/distsqlplan/aggregator_funcs_test.go, line 61 at r1 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Removed in latest revision. Comments from Reviewable |
The race was being caused in the
RunSyncFlow()
case only: when the flow'ssyncFlowConsumer
is an outbox, not a distSQLReceiver. Ifflow.cancel
runs afteroutbox.ProducerDone()
has been called by a router/processor, and tries to push an error into it, the outbox panics since itsRowChannel
is already closed.This PR fixes this race by adding the ability to mark the
distSQLReceiver
on the gateway node as cancelled asynchronously of thePush
/ProducerDone
calls, and more importantly, doing nothing when thesyncFlowConsumer
is an outbox (or anything other than a distSQLReceiver).Fixes #17851, fixes #17864.