spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

irfansharif · 2021-12-10T20:35:36Z

This is the tracking issue for follow-on work from #71994. Specifically we want to:

Checkpoint the spanconfig.Reconciler's incremental progress
Use the checkpoint to (possibly) avoid work if reconciling from scratch (if the job fails for any reason -- including pod shut down)
Ensure that the reconciliation job opportunistically re-attempts reconciliation if running into the (unlikely) rangefeed errors surfaced in rangefeed: surface unrecoverable errors and don't hopelessly retry #73086. These errors indicate that we were attempting to establish a rangefeed, with diffs, at a timestamp that was already GC-ed. Bouncing the reconciler again immediately instead of failing the whole job seems like saner recovery behavior.

The text was updated successfully, but these errors were encountered:

Fixes cockroachdb#75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in cockroachdb#80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in cockroachdb#78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None

79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15 This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes #57129 Relates to #79118 Release note: none Jira issue: CRDB-14769 79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal In this PR, we introduce a new CSV format for changefeeds. Note that this format is only supported with the initial_scan='only' option. For instance, one can now execute: CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only'; Release note (enterprise change): Support a CSV format for changefeeds. Only works with initial_scan='only', and does not work with diff/resolved options. 80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: ``` -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id ``` Release note: None 80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr Fixes #67888, #79914. Previously, the sessions page UI did not support displaying closed sessions and did not support the ability to filter by username or session status. This commit adds the "Closed" session status to closed sessions and adds the ability to filter by username and session status. Session Status: https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov Usernames: https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov Release note (ui change): sessions overview and session details pages now display closed sessions; sessions overview page now has username and session status filters Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Sherman Grewal <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Gerardo Torres <[email protected]>

Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None

github-actions · 2023-11-20T11:05:39Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-zone-configs labels Dec 10, 2021

irfansharif mentioned this issue Dec 15, 2021

spanconfig: harden infrastructure for v22.1 #73874

Closed

24 tasks

irfansharif self-assigned this Apr 20, 2022

irfansharif mentioned this issue Apr 20, 2022

spanconfig: assertion failure in sqlconfigwatcher.combine #75831

Closed

irfansharif mentioned this issue Apr 22, 2022

spanconfig: handle mismatched desc types post-restore #80397

Merged

blathers-crl bot mentioned this issue Apr 27, 2022

release-22.1: spanconfig: handle mismatched desc types post-restore #80603

Merged

irfansharif removed their assignment Apr 29, 2022

irfansharif mentioned this issue May 4, 2022

spanconfig: miscellaneous improvements/TODOs #81009

Open

13 tasks

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

ajwerner mentioned this issue Nov 1, 2022

spanconfigccl: full translation is O(Databases * Descriptors) #90655

Open

github-actions bot added the no-issue-activity label Nov 20, 2023

github-actions bot added the X-stale label Dec 4, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023

exalate-issue-sync bot closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

irfansharif commented Dec 10, 2021 •

edited

Loading

github-actions bot commented Nov 20, 2023

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

Comments

irfansharif commented Dec 10, 2021 • edited Loading

github-actions bot commented Nov 20, 2023

irfansharif commented Dec 10, 2021 •

edited

Loading