release-22.1: spanconfig/job: improve retry behaviour under failures #78220

blathers-crl · 2022-03-22T04:04:33Z

Backport 1/1 commits from #78117 on behalf of @irfansharif.

/cc @cockroachdb/release

Previously if the reconciliation job failed (say, with retryable buffer
overflow errors from the sqlwatcher¹), we relied on the jobs
subsystem's backoff mechanism to re-kick the reconciliation job. The
retry loop there however is far too large, and has a max back-off of
24H; far too long for the span config reconciliation job. Instead we can
control the retry behavior directly within the reconciliation job,
something this PR now does. We still want to bound the number of
internal retries, possibly bouncing the job around to elsewhere in the
cluster afterwards. To do so, we now use the spanconfig.Manager's
periodic checks (every 10m per node) -- we avoid the jobs subsystem
retry loop by marking every error as a permanent one.

Release justification: low risk, high benefit change
Release note: None

Release justification:

In future PRs we'll introduce tests adding 100k-1M tables in large
batches; when sufficiently large it's possible to blow past the
sqlwatcher's rangefeed buffer limits on incremental updates. In
these scenarios we want to gracefully fail + recover by re-starting
the reconciler and re-running the initial scan. ↩

Previously if the reconciliation job failed (say, with retryable buffer overflow errors from the sqlwatcher[1]), we relied on the jobs subsystem's backoff mechanism to re-kick the reconciliation job. The retry loop there however is far too large, and has a max back-off of 24H; far too long for the span config reconciliation job. Instead we can control the retry behavior directly within the reconciliation job, something this PR now does. We still want to bound the number of internal retries, possibly bouncing the job around to elsewhere in the cluster afterwards. To do so, we now use the spanconfig.Manager's periodic checks (every 10m per node) -- we avoid the jobs subsystem retry loop by marking every error as a permanent one. [1]: In future PRs we'll introduce tests adding 100k-1M tables in large batches; when sufficiently large it's possible to blow past the sqlwatcher's rangefeed buffer limits on incremental updates. In these scenarios we want to gracefully fail + recover by re-starting the reconciler and re-running the initial scan. Release justification: low risk, high benefit change Release note: None

blathers-crl · 2022-03-22T04:04:36Z

cockroach-teamcity · 2022-03-22T04:04:46Z

This change is

blathers-crl bot requested a review from a team as a code owner March 22, 2022 04:04

blathers-crl bot force-pushed the blathers/backport-release-22.1-78117 branch from 10dc1f1 to 5e9d0a5 Compare March 22, 2022 04:04

blathers-crl bot requested a review from ajwerner March 22, 2022 04:04

blathers-crl bot requested review from arulajmani and irfansharif March 22, 2022 04:04

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Mar 22, 2022

blathers-crl bot assigned irfansharif Mar 22, 2022

arulajmani approved these changes Mar 22, 2022

View reviewed changes

irfansharif merged commit 168e1a1 into release-22.1 Mar 22, 2022

irfansharif deleted the blathers/backport-release-22.1-78117 branch March 22, 2022 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.1: spanconfig/job: improve retry behaviour under failures #78220

release-22.1: spanconfig/job: improve retry behaviour under failures #78220

blathers-crl bot commented Mar 22, 2022

blathers-crl bot commented Mar 22, 2022 •

edited by irfansharif

Loading

cockroach-teamcity commented Mar 22, 2022

release-22.1: spanconfig/job: improve retry behaviour under failures #78220

release-22.1: spanconfig/job: improve retry behaviour under failures #78220

Conversation

blathers-crl bot commented Mar 22, 2022

Footnotes

blathers-crl bot commented Mar 22, 2022 • edited by irfansharif Loading

cockroach-teamcity commented Mar 22, 2022

blathers-crl bot commented Mar 22, 2022 •

edited by irfansharif

Loading