core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

vivekmenezes · 2019-04-19T13:29:04Z

E190418 14:30:01.900300 1 workload/cli/run.go:428  error in delivery: ERROR: operation "intent_resolver_ir_batcher.sendBatch" timed out after 30s (SQLSTATE XXUUU)

I'm seeing the above in https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-04-19T18:10:40Z

I just added back this timeout #36845.

I'm curious if the problem is that the returned error's cause is not context.DeadlineExceeded or if we would have hit this before in a sorely overloaded cluster. Should this failure be retried internally?

I'll investigate further to figure out how to error makes its way back to the client.

ajwerner · 2019-04-19T19:31:53Z

Had a conversation with @nvanbenschoten and @andreimatei about this. It seems inconsistent to return an error to a client due to an internal timeout, we certainly don't do that for general range unavailability. This seems to be a long-running conversation.

Interestingly we had these timeouts on intent resolution for a good long while and I suspect if you were to run this overloaded situation in 2.1 (see https://github.com/ajwerner/cockroach/blame/295b6ae3142518f04a5771c79ce171043697ee1f/pkg/storage/intent_resolver.go#L991) you would see the same timeout.

One proposed solution is to only add a timeout to intent resolution batches we know to consist only of asynchronous requests. Another option is to revert the timeout change. We should ask ourselves why we care about hung intent resolution. On some level they represent a leak but we do have some mechanisms in the intent resolver to push back the total amount of async intent resolution. @jordanlewis you specifically were concerned about them in #36806 (comment). I'm happy to type up the first but want to reach consensus on the right course of action before making a PR.

jordanlewis · 2019-04-19T23:05:04Z

I like the idea of adding the timeout only in the case of asynchronous intent resolution, but I don't have a serious horse in the race. I only was doing some drive by observation on those suspiciously hung goroutines.

vivekmenezes · 2019-04-22T16:31:33Z

I agree that we shouldn't be returning an error to the client here. How about you discuss this with other core folks and come up with a solution. Thanks

This commit respects the caller's deadline when sending intent resolution batches. It also sets a timeout using the previous value applied to all batches to asynchronous intent resolution calls. Fixes cockroachdb#36953. Release note: None

ajwerner · 2019-04-26T20:29:10Z

Closed by #37103.

This commit respects the caller's deadline when sending intent resolution batches. It also sets a timeout using the previous value applied to all batches to asynchronous intent resolution calls. Fixes cockroachdb#36953. Release note: None

vivekmenezes assigned ajwerner Apr 19, 2019

vivekmenezes added A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change. labels Apr 19, 2019

ajwerner closed this as completed Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

vivekmenezes commented Apr 19, 2019

ajwerner commented Apr 19, 2019

ajwerner commented Apr 19, 2019

jordanlewis commented Apr 19, 2019

vivekmenezes commented Apr 22, 2019

ajwerner commented Apr 26, 2019

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

Comments

vivekmenezes commented Apr 19, 2019

ajwerner commented Apr 19, 2019

ajwerner commented Apr 19, 2019

jordanlewis commented Apr 19, 2019

vivekmenezes commented Apr 22, 2019

ajwerner commented Apr 26, 2019