auto_expand_replicas might lead to premature shard deletion #21717
Labels
>bug
:Distributed Indexing/Distributed
A catch all label for anything in the Distributed Area. Please avoid if you can.
help wanted
adoptme
Assume a 2-node cluster with an index with 1 primary and 1 replica. The index has
auto_expand_replicas
set0-all
. The second node drops, which leads to the first node automatically resetting thenumber_of_replicas
to 0. This is a process that's triggered in a delayed fashion.MetaDataUpdateSettingsService
listens on cluster state changed events and submits a cluster state update task to adjustnumber_of_replicas
when it detects that the number of data nodes is smaller/greater than the number of currently-configured replicas. Assume this update successfully completes. When the second node rejoins, it sees a shard routing table that has all shards active (= primary only) and starts deleting it's local shard copy. Shortly thereafter (maybe a few milliseconds?) the first node updates the cluster state by auto-expanding the number of replicas back to one. The second node however has deleted the data and needs to resync the complete shard.The text was updated successfully, but these errors were encountered: