Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsqlrun: Fix RowChannel race in outbox upon context cancellation. #17870

Merged
merged 1 commit into from
Aug 24, 2017

Conversation

itsbilal
Copy link
Contributor

@itsbilal itsbilal commented Aug 23, 2017

The race was being caused in the RunSyncFlow() case only: when the flow's syncFlowConsumer is an outbox, not a distSQLReceiver. If flow.cancel runs after outbox.ProducerDone() has been called by a router/processor, and tries to push an error into it, the outbox panics since its RowChannel is already closed.

This PR fixes this race by adding the ability to mark the distSQLReceiver on the gateway node as cancelled asynchronously of the Push/ProducerDone calls, and more importantly, doing nothing when the syncFlowConsumer is an outbox (or anything other than a distSQLReceiver).

Fixes #17851, fixes #17864.

@itsbilal itsbilal requested review from andreimatei and a team August 23, 2017 21:01
@itsbilal itsbilal requested a review from a team as a code owner August 23, 2017 21:01
@itsbilal itsbilal requested review from a team August 23, 2017 21:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Aug 23, 2017

Please make sure to unskip the tests 😄

@tbg
Copy link
Member

tbg commented Aug 23, 2017

Do you think this fixes #17869 as well?

panic: send on closed channel [recovered]
	panic: send on closed channel

goroutine 5027 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc42160e090, 0x7f293e7e8060, 0xc422173080)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:182 +0xb2
panic(0x193ad60, 0xc4213e2710)
	/usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*RowChannel).Push(0xc422194c80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x279cce0, 0xc4221b4c60, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:320 +0x1d8

If so, mind adding it to your PR description?

@itsbilal itsbilal force-pushed the distsql-cancel-race-fix branch from c1cbef1 to 53715bc Compare August 24, 2017 15:41
@itsbilal
Copy link
Contributor Author

After an offline chat with @andreimatei I decided to implement a specialized RowReceiver called a CancellableRowReceiver that can be cancelled asynchronously of the Push/ProducerDone calls, and use it in the distSQLReceiver case. Push() calls after a SetCancelled return a ConsumerClosed and set the internal error to query execution cancelled.

This eliminates race conditions in distSQLReceiver around the err field, and eliminates the need to pass a closure to Wait.

PTAL!

@andreimatei
Copy link
Contributor

LGTM

PR and commit messages should be identical.
You might need separate "fixes" lines for this to actually close the issues, I'm not sure.


Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/distsql_running.go, line 228 at r2 (raw file):

	err error

	// cancelled is 1 when this distSQLReceiver has been asynchronously set

comment that this is supposed to be set atomically

"has been async set" -> async wrt what? I'd drop it.


pkg/sql/distsqlrun/base.go, line 96 at r2 (raw file):

// CancellableRowReceiver is a special type of a RowReceiver that can be set to
// cancelled asynchronously (i.e. concurrently or after Push()es and ProducerDone()s).
// Once cancelled, subsequent Push()es return ConsumerClosed.

hint who implements this and say something about why cancellation has to be signaled this way to the DistSQLReceiver but not to other consumers.


pkg/sql/distsqlplan/aggregator_funcs_test.go, line 61 at r1 (raw file):

	}
	flow.Start(ctx, func() {})
	flow.Wait(func() {})

comment literals inline, here and elsewhere


Comments from Reviewable

The race was being caused in the `RunSyncFlow()` case only: when the
flow's `syncFlowConsumer` is an outbox, not a distSQLReceiver.  If
`flow.cancel` runs after `outbox.ProducerDone()` has been called by a
router/processor, and tries to push an error into it,
the outbox panics since its `RowChannel` is already closed.

This PR fixes this race by adding the ability to mark the
`distSQLReceiver` on the gateway node as cancelled asynchronously of the
`Push`/`ProducerDone` calls, and more importantly, doing nothing when
the `syncFlowConsumer` is an outbox (or anything other than a
distSQLReceiver).

Fixes cockroachdb#17851, fixes cockroachdb#17864.
@itsbilal itsbilal force-pushed the distsql-cancel-race-fix branch from 53715bc to c92af32 Compare August 24, 2017 17:01
@itsbilal
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions.


pkg/sql/distsql_running.go, line 228 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment that this is supposed to be set atomically

"has been async set" -> async wrt what? I'd drop it.

Done.


pkg/sql/distsqlrun/base.go, line 96 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hint who implements this and say something about why cancellation has to be signaled this way to the DistSQLReceiver but not to other consumers.

Done.


pkg/sql/distsqlplan/aggregator_funcs_test.go, line 61 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment literals inline, here and elsewhere

Removed in latest revision.


Comments from Reviewable

@itsbilal itsbilal merged commit 344c551 into cockroachdb:master Aug 24, 2017
@itsbilal itsbilal deleted the distsql-cancel-race-fix branch August 24, 2017 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants