Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

wzrdtales · 2022-05-27T08:26:10Z

Describe the problem

After upgrading to 22.1, several migrations were running through fine. Since 10 hours it is however stuck with 21.2-56 and doesn't finish on it. (Retrying the 9th time now)

To Reproduce

Upgrade to 22.1

On kubernetes.

Jira issue: CRDB-16140

blathers-crl · 2022-05-27T08:26:13Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I was unable to automatically find someone to ping.

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

wzrdtales · 2022-05-27T12:35:06Z

10th retry now...

wzrdtales · 2022-05-27T12:42:34Z

I went through the nodes and their logs and found this now on one node:

so this seems to be a display bug, but also the progress is very slow, not sure if intended. With that speed it will probably take around 3-4 days to finish.

irfansharif · 2022-06-07T14:35:56Z

With that speed it will probably take around 3-4 days to finish.

@wzrdtales How many total ranges do you have in this cluster? Judging from that screenshot we're churning through ~3800 ranges per minute. With the estimate of 3-4 days, are you saying this cluster has 16,416,000+ ranges?

I'm more interested in the retry loop itself rather than how long each attempt takes. Do you happen to have logs around where these retries occur? If it's because of timeouts in applying the Migrate request, I wonder if we should bump kv.migration.migrate_application_timeout to give it ample time to proceed. Say to something like 5m?

but also the progress is very slow, not sure if intended

It is partially intended, though the retry behavior not so much. The slow pace is to pace the internal migrations such that they're non-disruptive to foreground traffic. We want to avoid a thundering herd of work on upgrades.

wzrdtales · 2022-06-07T15:19:29Z

I was calculating the whole runtime, it ended up needing around 24 hours in total I think. It resolved on its own, but in a weird status which left us worried (as you noticed :))

With the estimate of 3-4 days, are you saying this cluster has 16,416,000+ ranges?

Not yet that big, "only" 300k+ ranges.

irfansharif · 2022-06-07T15:34:44Z

I'm glad it resolved, I'll close this issue. I'll add a suggestion to #72931 to let operators control the pacing here more directly, if they want to speed up the migrate parallelism for clusters with large range counts while keeping an eye out on foreground impact. I'll also note that some of these migrations are intended to be long-running and operating over the timescales you're observing: #48843.

wzrdtales added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 27, 2022

blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels May 27, 2022

blathers-crl bot added the T-kv KV Team label May 27, 2022

yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label May 27, 2022

irfansharif self-assigned this Jun 1, 2022

irfansharif closed this as completed Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

wzrdtales commented May 27, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented May 27, 2022

wzrdtales commented May 27, 2022

wzrdtales commented May 27, 2022

irfansharif commented Jun 7, 2022

wzrdtales commented Jun 7, 2022

irfansharif commented Jun 7, 2022

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

Comments

wzrdtales commented May 27, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented May 27, 2022

wzrdtales commented May 27, 2022

wzrdtales commented May 27, 2022

irfansharif commented Jun 7, 2022

wzrdtales commented Jun 7, 2022

irfansharif commented Jun 7, 2022

wzrdtales commented May 27, 2022 •

edited by cockroach-jira-scripts

Loading