Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

Closed
vivekmenezes opened this issue Apr 19, 2019 · 5 comments
Closed

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

vivekmenezes opened this issue Apr 19, 2019 · 5 comments
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change.

Comments

@vivekmenezes
Copy link
Contributor

E190418 14:30:01.900300 1 workload/cli/run.go:428  error in delivery: ERROR: operation "intent_resolver_ir_batcher.sendBatch" timed out after 30s (SQLSTATE XXUUU)

I'm seeing the above in https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog

@vivekmenezes vivekmenezes added A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change. labels Apr 19, 2019
@ajwerner
Copy link
Contributor

I just added back this timeout #36845.

I'm curious if the problem is that the returned error's cause is not context.DeadlineExceeded or if we would have hit this before in a sorely overloaded cluster. Should this failure be retried internally?

I'll investigate further to figure out how to error makes its way back to the client.

@ajwerner
Copy link
Contributor

Had a conversation with @nvanbenschoten and @andreimatei about this. It seems inconsistent to return an error to a client due to an internal timeout, we certainly don't do that for general range unavailability. This seems to be a long-running conversation.

Interestingly we had these timeouts on intent resolution for a good long while and I suspect if you were to run this overloaded situation in 2.1 (see https://github.com/ajwerner/cockroach/blame/295b6ae3142518f04a5771c79ce171043697ee1f/pkg/storage/intent_resolver.go#L991) you would see the same timeout.

One proposed solution is to only add a timeout to intent resolution batches we know to consist only of asynchronous requests. Another option is to revert the timeout change. We should ask ourselves why we care about hung intent resolution. On some level they represent a leak but we do have some mechanisms in the intent resolver to push back the total amount of async intent resolution. @jordanlewis you specifically were concerned about them in #36806 (comment). I'm happy to type up the first but want to reach consensus on the right course of action before making a PR.

@jordanlewis
Copy link
Member

I like the idea of adding the timeout only in the case of asynchronous intent resolution, but I don't have a serious horse in the race. I only was doing some drive by observation on those suspiciously hung goroutines.

@vivekmenezes
Copy link
Contributor Author

I agree that we shouldn't be returning an error to the client here. How about you discuss this with other core folks and come up with a solution. Thanks

ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 25, 2019
This commit respects the caller's deadline when sending intent resolution
batches. It also sets a timeout using the previous value applied to all
batches to asynchronous intent resolution calls.

Fixes cockroachdb#36953.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 25, 2019
This commit respects the caller's deadline when sending intent resolution
batches. It also sets a timeout using the previous value applied to all
batches to asynchronous intent resolution calls.

Fixes cockroachdb#36953.

Release note: None
@ajwerner
Copy link
Contributor

Closed by #37103.

ajwerner added a commit to ajwerner/cockroach that referenced this issue May 15, 2019
This commit respects the caller's deadline when sending intent resolution
batches. It also sets a timeout using the previous value applied to all
batches to asynchronous intent resolution calls.

Fixes cockroachdb#36953.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

3 participants