Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

Closed
wzrdtales opened this issue May 27, 2022 · 6 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team

Comments

@wzrdtales
Copy link

wzrdtales commented May 27, 2022

Describe the problem

After upgrading to 22.1, several migrations were running through fine. Since 10 hours it is however stuck with 21.2-56 and doesn't finish on it. (Retrying the 9th time now)

To Reproduce

Upgrade to 22.1

On kubernetes.

Jira issue: CRDB-16140

@wzrdtales wzrdtales added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 27, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 27, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I was unable to automatically find someone to ping.

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels May 27, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 27, 2022
@wzrdtales
Copy link
Author

10th retry now...

@wzrdtales
Copy link
Author

I went through the nodes and their logs and found this now on one node:

image

so this seems to be a display bug, but also the progress is very slow, not sure if intended. With that speed it will probably take around 3-4 days to finish.

@yuzefovich yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label May 27, 2022
@irfansharif irfansharif self-assigned this Jun 1, 2022
@irfansharif
Copy link
Contributor

With that speed it will probably take around 3-4 days to finish.

@wzrdtales How many total ranges do you have in this cluster? Judging from that screenshot we're churning through ~3800 ranges per minute. With the estimate of 3-4 days, are you saying this cluster has 16,416,000+ ranges?

I'm more interested in the retry loop itself rather than how long each attempt takes. Do you happen to have logs around where these retries occur? If it's because of timeouts in applying the Migrate request, I wonder if we should bump kv.migration.migrate_application_timeout to give it ample time to proceed. Say to something like 5m?

but also the progress is very slow, not sure if intended

It is partially intended, though the retry behavior not so much. The slow pace is to pace the internal migrations such that they're non-disruptive to foreground traffic. We want to avoid a thundering herd of work on upgrades.

@wzrdtales
Copy link
Author

I was calculating the whole runtime, it ended up needing around 24 hours in total I think. It resolved on its own, but in a weird status which left us worried (as you noticed :))

With the estimate of 3-4 days, are you saying this cluster has 16,416,000+ ranges?

Not yet that big, "only" 300k+ ranges.

@irfansharif
Copy link
Contributor

I'm glad it resolved, I'll close this issue. I'll add a suggestion to #72931 to let operators control the pacing here more directly, if they want to speed up the migrate parallelism for clusters with large range counts while keeping an eye out on foreground impact. I'll also note that some of these migrations are intended to be long-running and operating over the timescales you're observing: #48843.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants