release-22.1: spanconfig/job: improve retry behaviour under failures #78220
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #78117 on behalf of @irfansharif.
/cc @cockroachdb/release
Previously if the reconciliation job failed (say, with retryable buffer
overflow errors from the sqlwatcher1), we relied on the jobs
subsystem's backoff mechanism to re-kick the reconciliation job. The
retry loop there however is far too large, and has a max back-off of
24H; far too long for the span config reconciliation job. Instead we can
control the retry behavior directly within the reconciliation job,
something this PR now does. We still want to bound the number of
internal retries, possibly bouncing the job around to elsewhere in the
cluster afterwards. To do so, we now use the spanconfig.Manager's
periodic checks (every 10m per node) -- we avoid the jobs subsystem
retry loop by marking every error as a permanent one.
Release justification: low risk, high benefit change
Release note: None
Release justification:
Footnotes
In future PRs we'll introduce tests adding 100k-1M tables in large
batches; when sufficiently large it's possible to blow past the
sqlwatcher's rangefeed buffer limits on incremental updates. In
these scenarios we want to gracefully fail + recover by re-starting
the reconciler and re-running the initial scan. ↩