Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: test flakes with cannot remove learner while snapshot is in flight #98672

Closed
4 tasks done
kvoli opened this issue Mar 15, 2023 · 4 comments
Closed
4 tasks done
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Mar 15, 2023

There are multiple tests flaking due to this error:

cannot remove learner while snapshot is in flight

The error is created when an admin command runs into learners, attempts to remove them and discovers there is a snapshot in flight to the learner.

This error commonly occurs when there are concurrent range modification commands being issued, such as by the replicate queue, store rebalancer and possibly the test itself.

This issue is to track these test flakes and determine a solution which will make tests more robust to this problem; either by updating the code itself or tests.

Flakes:

Jira issue: CRDB-25477

@kvoli kvoli added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 15, 2023
@kvoli kvoli self-assigned this Mar 15, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 15, 2023
@andrewbaptist
Copy link
Contributor

I'm looking through the code for this, and the method maybeLeaveAtomicChangeReplicasAndRemoveLearners is a little strange since it is trying to both exit a joint config and remove learners. The exiting joint config makes sense, but removing learners is a little "harsh". This is used a few different places.

  1. The merge queue - it wants to merge ranges, so it needs to have the range be in a stable state. I'm not sure removing learners is the right option here
  2. AdminRelocateRange - this is mostly called by end users, but also used by the store rebalancer and the merge queue (so the merge queue calls this twice in sequence)
  3. changeReplicasImpl - this is used in the case that we are demoting voters to non-voters. This seems is probably an appropriate usage since we just demoted voters, but it could be overkill if there is another change occurring
  4. execReplicationChangesForVoters - this is only called by changeReplicasImpl in the case of either adding or removing voters.

There is a risk that if there was a concurrent descriptor change, we shouldn’t proceed with our removal of learners. There is a function validateReplicationChanges that we should probably run to verify this if we should still make a change here.

That said, this code is a little bit of a mess, with many repeated calls to the same checks. An AdminRelocateRange call might end up calling this multiple times before it finishes. Many of them are no-ops, but it probably doesn't need to be called as often.

@kvoli we should probably do a walkthrough of this code at some point for 23.2 and figure out how to best clean this up.

@erikgrinaker
Copy link
Contributor

@tbg has a possible mitigation over in #99099.

@tbg
Copy link
Member

tbg commented Mar 22, 2023

#99099 is now in the bors queue. I will leave this issue alone, but we could entertain rolling back some of the ad-hoc mitigations that have occurred so far before closing.

@kvoli kvoli removed their assignment Aug 3, 2023
@kvoli
Copy link
Collaborator Author

kvoli commented Aug 3, 2023

Going to close this out, we've implemented the changes we wanted.

@kvoli kvoli closed this as completed Aug 3, 2023
@kvoli kvoli self-assigned this Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants