Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/distsqlrun: drain the source if the dest requests draining #22832

Merged
merged 1 commit into from
Feb 20, 2018

Conversation

petermattis
Copy link
Collaborator

Fix a bug in Run where the source was not drainined and closed if the
producer requested draining.

Closes #22824
Fixes #22655
Fixes #22654
Fixes #22642

Release note: None

@petermattis petermattis requested review from a team February 20, 2018 13:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 2 files reviewed at latest revision, all discussions resolved, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

		!sqlbase.IsQueryCanceledError(err) &&
		!strings.Contains(err.Error(), "canceled") && // TODO(andrei): this shouldn't be necessary
		!strings.Contains(err.Error(), "rpc error") { // TODO(andrei): this shouldn't be necessary

@andreimatei FYI, these seem to have appeared due to #22277. Seems like the new executor is no longer properly returning query canceled error in all cases.


Comments from Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@asubiotto
Copy link
Contributor

:lgtm: Thanks for fixing this


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

			case DrainRequested:
				DrainAndForwardMetadata(ctx, src, dst)
				dst.ProducerDone()

You could remove this ProducerDone and return here and below and let the execution path fall through to the ProducerDone and return at the end.


Comments from Reviewable

@rjnn
Copy link
Contributor

rjnn commented Feb 20, 2018

:lgtm:


Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

You could remove this ProducerDone and return here and below and let the execution path fall through to the ProducerDone and return at the end.

If you do that, you could do that for the ConsumerClosed case below as well.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@andreimatei FYI, these seem to have appeared due to #22277. Seems like the new executor is no longer properly returning query canceled error in all cases.

Let me see what differences are with what errors were generated before.
Please don't merge with such a TODO with my name :P


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Let me see what differences are with what errors were generated before.
Please don't merge with such a TODO with my name :P

One of the errors I see returned here is pq: could not cancel query 151514c864d130df0000000000000001: query ID 151514c864d130df0000000000000001 not found. So something's amiss.
But this test generally seems funky to me. Looking some more.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

One of the errors I see returned here is pq: could not cancel query 151514c864d130df0000000000000001: query ID 151514c864d130df0000000000000001 not found. So something's amiss.
But this test generally seems funky to me. Looking some more.

Yeah, either the test is funky or the new executor is. Let me try running with the old executor. I'll hold off on merging until you give this a thumbs-up.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

If you do that, you could do that for the ConsumerClosed case below as well.

I tried making that change right now, and while it is correct, the control flow gets more confusing. I'd prefer to leave this as-is. I feel the original problem here was trying to get too clever with control flow, so perhaps I'm overreacting here.


Comments from Reviewable

@rjnn
Copy link
Contributor

rjnn commented Feb 20, 2018

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I tried making that change right now, and while it is correct, the control flow gets more confusing. I'd prefer to leave this as-is. I feel the original problem here was trying to get too clever with control flow, so perhaps I'm overreacting here.

Leaving it as is seems fine to me. There's no benefit except reducing the LOC, which isn't really a worthwhile metric to optimize on.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yeah, either the test is funky or the new executor is. Let me try running with the old executor. I'll hold off on merging until you give this a thumbs-up.

I was fooling myself before - this test races a cancelation against a query and sometimes the query finishes before the cancelation does anything. That's what the error I pasted was about, and apparently that's expected and the test is cool with that.
Now trying to understand what error(s) should really be expected.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I was fooling myself before - this test races a cancelation against a query and sometimes the query finishes before the cancelation does anything. That's what the error I pasted was about, and apparently that's expected and the test is cool with that.
Now trying to understand what error(s) should really be expected.

What should be expected I still have to figure out, but I know the difference between the old and new code. The Executor had this code that prepended a message to all errors containing the string "context canceled". I removed it because it seemed like a bad idea.

// If error contains a context cancellation error, wrap it with a

I'll figure something out.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

What should be expected I still have to figure out, but I know the difference between the old and new code. The Executor had this code that prepended a message to all errors containing the string "context canceled". I removed it because it seemed like a bad idea.

// If error contains a context cancellation error, wrap it with a

I'll figure something out.

Ack. Do you want me to merge this PR as-is? Hold off on merging? Revert the changes to this file?


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack. Do you want me to merge this PR as-is? Hold off on merging? Revert the changes to this file?

Hold off for a bit pls.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 269 at r1 (raw file):

	// does not imply that the query was canceled.
	if err := <-errChan; err != nil &&
		!sqlbase.IsQueryCanceledError(err) &&

btw, this sqlbase.IsQueryCanceledError() check is LOL - it's a combination of a structured check that works for server-side errors and text matching that sometimes works for client-side errors.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/run_control_test.go, line 271 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Hold off for a bit pls.

the crux of the issue here, I believe, is that, upon (DistSQL) query cancelation, a query's context gets canceled. Depending on how far along the flow setup has gone, I think one of the following things can happen:

  • the gateway node has a "sync" Flow. If that flow is blocked in its Wait() method, that method watches for this ctx cancelation and calls Flow.cancel(), which remembers an error that will be later returned to the client. This is, I think, the only code path that actually results in the correct pg error code being returned.
  • if, however, the upstream flows are still being setup when the ctx is canceled, the gateway's flow gets an RPC error from the SetupFlowRequests that it sends to its upstreams. If that's the case, this RPC errors ends up being the one returned.
  • if the upstream flows have been setup, I think it's also possible for an error to be generated in processInboundStreamHelper - that guy also checks for context cancelation. I think this guy doesn't strictly need to be checking for ctx cancelation, as a ConsumerClosed code would eventually be propagated from the downstream, but it's probably a good idea to have it still.

The race itself isn't a problem, but the fact that it can result in different errors is. I think there's two ways to go about it - either ensure that the query's error is set before we set off the race for observing context cancelation, and make sure that once it's set, it stays frozen or the opposite - let things race, but introduce a way to overwrite a query's error once things have settled down.
I'm leaning towards the former.

cc @abhimadan if any of this is helpful to the stuff he's looking at


Comments from Reviewable

Fix a bug in `Run` where the source was not drainined and closed if the
producer requested draining.

Fixes cockroachdb#22665
Fixes cockroachdb#22642
See cockroachdb#22654

Release note: None
@jordanlewis
Copy link
Member

Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Leaving it as is seems fine to me. There's no benefit except reducing the LOC, which isn't really a worthwhile metric to optimize on.

This logic is the cause of the panic in #22772, so I'm enlisting you to fix it. :-)

In the scenario where src.Next() returns an error in its meta (which implies that the src has closed itself), and dst.Push returns DrainRequested, we end up calling src.Next() on a processor that's already been closed because of the DrainAndForwardMetadata step.

My first thought is that calling DrainAndForwardMetadata here needs to be guarded by if meta.Err != nil but that might be broken for other reasons.

In general, the contract around this stuff is confusing and may have changed. The new contract seems to be "you may not call Next again if it returned an error in its metadata" but that contract is not written in the GoDoc for Next, leading me to believe this wasn't always the contract.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/base.go, line 154 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This logic is the cause of the panic in #22772, so I'm enlisting you to fix it. :-)

In the scenario where src.Next() returns an error in its meta (which implies that the src has closed itself), and dst.Push returns DrainRequested, we end up calling src.Next() on a processor that's already been closed because of the DrainAndForwardMetadata step.

My first thought is that calling DrainAndForwardMetadata here needs to be guarded by if meta.Err != nil but that might be broken for other reasons.

In general, the contract around this stuff is confusing and may have changed. The new contract seems to be "you may not call Next again if it returned an error in its metadata" but that contract is not written in the GoDoc for Next, leading me to believe this wasn't always the contract.

Ack. Pretty sure that #22772 is a different issue. I'll take a look.


Comments from Reviewable

@petermattis petermattis merged commit 809b8fe into cockroachdb:master Feb 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants