Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: checkpoint the reconciliation job and retry eagerly when possible #73694

Closed
1 of 3 tasks
Tracked by #81009 ...
irfansharif opened this issue Dec 10, 2021 · 1 comment
Closed
1 of 3 tasks
Tracked by #81009 ...
Labels
A-zone-configs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Dec 10, 2021

This is the tracking issue for follow-on work from #71994. Specifically we want to:

  • Checkpoint the spanconfig.Reconciler's incremental progress
  • Use the checkpoint to (possibly) avoid work if reconciling from scratch (if the job fails for any reason -- including pod shut down)
  • Ensure that the reconciliation job opportunistically re-attempts reconciliation if running into the (unlikely) rangefeed errors surfaced in rangefeed: surface unrecoverable errors and don't hopelessly retry  #73086. These errors indicate that we were attempting to establish a rangefeed, with diffs, at a timestamp that was already GC-ed. Bouncing the reconciler again immediately instead of failing the whole job seems like saner recovery behavior.

Jira issue: CRDB-11696

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-zone-configs labels Dec 10, 2021
@irfansharif irfansharif self-assigned this Apr 20, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 22, 2022
Fixes cockroachdb#75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in cockroachdb#80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in cockroachdb#78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
craig bot pushed a commit that referenced this issue Apr 26, 2022
79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15

This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes #57129
Relates to #79118

Release note: none

Jira issue: CRDB-14769

79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal

In this PR, we introduce a new CSV format for changefeeds.
Note that this format is only supported with the
initial_scan='only' option. For instance, one can now
execute:

CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only';

Release note (enterprise change): Support a CSV format for
changefeeds. Only works with initial_scan='only', and
does not work with diff/resolved options.

80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif

Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

```
   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id
```

Release note: None

80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr

Fixes #67888, #79914.

Previously, the sessions page UI did not support displaying closed
sessions and did not support the ability to filter by username or
session status. This commit adds the "Closed" session status to closed
sessions and adds the ability to filter by username and session status.

Session Status:
https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov

Usernames:
https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov

Release note (ui change): sessions overview and session details pages now
display closed sessions; sessions overview page now has username and session
status filters

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Sherman Grewal <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Gerardo Torres <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 27, 2022
Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
@irfansharif irfansharif removed their assignment Apr 29, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-zone-configs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

2 participants