-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953
Comments
I just added back this timeout #36845. I'm curious if the problem is that the returned error's cause is not context.DeadlineExceeded or if we would have hit this before in a sorely overloaded cluster. Should this failure be retried internally? I'll investigate further to figure out how to error makes its way back to the client. |
Had a conversation with @nvanbenschoten and @andreimatei about this. It seems inconsistent to return an error to a client due to an internal timeout, we certainly don't do that for general range unavailability. This seems to be a long-running conversation. Interestingly we had these timeouts on intent resolution for a good long while and I suspect if you were to run this overloaded situation in 2.1 (see https://github.com/ajwerner/cockroach/blame/295b6ae3142518f04a5771c79ce171043697ee1f/pkg/storage/intent_resolver.go#L991) you would see the same timeout. One proposed solution is to only add a timeout to intent resolution batches we know to consist only of asynchronous requests. Another option is to revert the timeout change. We should ask ourselves why we care about hung intent resolution. On some level they represent a leak but we do have some mechanisms in the intent resolver to push back the total amount of async intent resolution. @jordanlewis you specifically were concerned about them in #36806 (comment). I'm happy to type up the first but want to reach consensus on the right course of action before making a PR. |
I like the idea of adding the timeout only in the case of asynchronous intent resolution, but I don't have a serious horse in the race. I only was doing some drive by observation on those suspiciously hung goroutines. |
I agree that we shouldn't be returning an error to the client here. How about you discuss this with other core folks and come up with a solution. Thanks |
This commit respects the caller's deadline when sending intent resolution batches. It also sets a timeout using the previous value applied to all batches to asynchronous intent resolution calls. Fixes cockroachdb#36953. Release note: None
This commit respects the caller's deadline when sending intent resolution batches. It also sets a timeout using the previous value applied to all batches to asynchronous intent resolution calls. Fixes cockroachdb#36953. Release note: None
Closed by #37103. |
This commit respects the caller's deadline when sending intent resolution batches. It also sets a timeout using the previous value applied to all batches to asynchronous intent resolution calls. Fixes cockroachdb#36953. Release note: None
I'm seeing the above in https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog
The text was updated successfully, but these errors were encountered: