Exponential backoff of failed allocation #24530

clintongormley · 2017-05-06T11:52:12Z

In #18467 we solved the problem where the failed allocation of a shard is retried in a tight loop, filling up the log file with exceptions. Now, after five failures, the allocation is no longer attempted until the user triggers it.

The downside of this approach is that is requires user intervention.

Would it be possible to add some kind of exponential backoff so that allocation attempts continue to be made, but with less frequency. That way we still avoid flooding the logs but if the situation resolves itself, the shard will be allocated automatically.

clintongormley · 2017-05-06T11:52:44Z

CC @ywelsch, @bleskes

bleskes · 2017-05-08T08:43:35Z

I think it's good to explore this. We can still keep the hard limit (and may increase it) - we built the feature for configuration mistakes - but delay the speed of re-assignment.

@clintongormley did you run into a specific issue that triggered this?

clintongormley · 2017-05-08T09:29:54Z

@bleskes Just from user feedback

Previously, a failed allocation was retried in a tight loop that filled up log files and caused the cluster be unstable. We solved this problem by limiting the number of retries. However, this solution requires manual intervention when the environment is adjusted. This PR aims to reduce user intervention by increasing the number of retries and adding some exponential backoff delays between retries. Closes elastic#24530

elasticmachine · 2018-03-15T13:58:16Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-04-17T16:35:51Z

FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)

dhwanilpatel · 2020-01-15T13:45:55Z

Hello,
If nobody is working on it, I would like to pick it up.
Any initial thoughts are welcomed. 🙂

DaveCTurner · 2020-01-17T07:33:37Z

Thanks @dhwanilpatel. I've already started working on this. I've removed the misleading help wanted label.

DaveCTurner · 2020-02-28T08:45:44Z

As far as I've been able to tell, the only case where we need indefinite retries is where the allocation repeatedly fails due to a ShardLockObtainFailedException because the shard is already open on the node thanks to an earlier allocation, and although it's in the process of shutting down it does not do so quickly enough. Frequently this is due to a temporarily flaky network resulting in a node leaving and rejoining the cluster a few times. By default, we wait for 5 seconds and retry 5 times, but it's definitely possible today for a shard to take more than 25 seconds to shut down.

The effect of the proposal here would be to keep retrying until the shard eventually shuts down, no matter how long that takes. I would prefer that we address the underlying causes of slow shard shutdowns, because this will bring the cluster back to health much more quickly and will result in fewer full-shard recoveries after a network wobble.

DaveCTurner · 2020-03-05T07:48:03Z

Another reason for failing allocations that eventually succeed is CircuitBreakingExceptions; we are discussing making recoveries more resilient to memory pressure in #44484.

A related point is that we typically only repeatedly try allocation on one or two nodes, because we only avoid the very last failed node in the ReplicaShardAllocator. Since #48265 we keep track of the nodes behind all failed allocations, so we could make use this to try more nodes.

amathur1893 · 2020-11-12T13:14:54Z

Hi,

any update on this? Has this been picked yet?

DaveCTurner · 2020-11-12T14:29:02Z

Work continues on making it so we no longer need this feature, yes.

xiankaing · 2020-11-20T23:45:53Z

I doubt that it would be possible to avoid all scenarios of need exponential backoff on retries.
Flakey networks are probably here to stay, at least at times.

If we're bothering to retry shard allocation anyway, why not do it right and have a backoff system?
It's an easy win (vs. fixing all the infinite possible issues).

By the way, using AWS' ES service on CN-Northeast unassigned shards come up after N retries more frequently than it seems reasonable. Something like once a month.
But if I just bump the retry count by 1 (so that it will retry once more), some human amount of time after cluster status becomes red, it works.

jamshid · 2022-08-23T22:46:43Z

Seeing this on 7.5.2, I guess nodes were out of space but it stopped retrying? That seems like something that should be permanently retried rather than requiring a manual curl POST.

{
  "index" : "index_X0",
  "shard" : 3,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2022-08-09T10:16:09.963Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [Y]: failed recovery, failure RecoveryFailedException[[index_X][3]: Recovery failed from {Z}{A}{j-B}{10.x.x.x}{10.x.x.x:9300}{dilm}{ml.machine_memory=16637530112, ml.max_open_jobs=20, xpack.installed=true} into {X}{Y}{Z}{10.x.x.x}{10.x.x.x:9300}{dilm}{ml.machine_memory=16637530112, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[Y][10.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[Y][10.x.x.x:9300][internal:index/shard/recovery/file_chunk]]; nested: IOException[No space left on device]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",

elasticsearchmachine · 2023-01-16T15:02:47Z

Pinging @elastic/es-distributed (Team:Distributed)

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Instead it makes more sense to wait indefinitely. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates #24530

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530 Backport of elastic#94545 and elastic#94623 (and a little bit of elastic#94417) to 8.7

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates #24530 Backport of #94545 and #94623 (and a little bit of #94417) to 8.7

DaveCTurner · 2024-05-30T21:00:53Z

Recent changes such as #95121 and #108145 have greatly diminished the failure rate for shard allocation due to unavailable shard locks, and other miscellaneous changes have made it less susceptible to memory pressure too. We'll continue to address other reasons for failed allocations, but as a general rule we'd rather make the recovery process resilient to failure at lower levels and avoid retrying the top-level allocation completely. Therefore I'm closing this.

clintongormley added :Allocation discuss >enhancement labels May 6, 2017

bleskes added help wanted adoptme and removed discuss labels May 12, 2017

jasontedor added the high hanging fruit label May 12, 2017

bleskes removed the high hanging fruit label May 23, 2017

dnhatn self-assigned this Oct 11, 2017

dnhatn removed the help wanted adoptme label Oct 12, 2017

dnhatn mentioned this issue Oct 23, 2017

Allocation: add delay between retries for failed allocations #27086

Closed

dnhatn mentioned this issue Nov 30, 2017

Shards left unassigned by allocation decisions, requiring manual rerouting #17963

Closed

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

dnhatn removed their assignment Apr 7, 2018

tlrx mentioned this issue Apr 16, 2018

Snapshot recovery failures lock the cluster in red state and prevent additional snapshot operations from running #29423

Closed

ywelsch added the help wanted adoptme label Apr 17, 2018

original-brownbear self-assigned this Jan 8, 2019

original-brownbear removed their assignment Oct 3, 2019

DaveCTurner self-assigned this Oct 16, 2019

DaveCTurner removed the help wanted adoptme label Jan 17, 2020

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

DaveCTurner removed their assignment Jan 16, 2023

DaveCTurner mentioned this issue Mar 20, 2023

Async creation of IndexShard instances #94545

Merged

DaveCTurner mentioned this issue Apr 11, 2023

Async creation of IndexShard instances #95121

Merged

DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential backoff of failed allocation #24530

Exponential backoff of failed allocation #24530

clintongormley commented May 6, 2017

clintongormley commented May 6, 2017

bleskes commented May 8, 2017

clintongormley commented May 8, 2017

elasticmachine commented Mar 15, 2018

DaveCTurner commented Apr 17, 2018

dhwanilpatel commented Jan 15, 2020

DaveCTurner commented Jan 17, 2020

DaveCTurner commented Feb 28, 2020

DaveCTurner commented Mar 5, 2020

amathur1893 commented Nov 12, 2020

DaveCTurner commented Nov 12, 2020

xiankaing commented Nov 20, 2020 •

edited

Loading

jamshid commented Aug 23, 2022

elasticsearchmachine commented Jan 16, 2023

DaveCTurner commented May 30, 2024

Exponential backoff of failed allocation #24530

Exponential backoff of failed allocation #24530

Comments

clintongormley commented May 6, 2017

clintongormley commented May 6, 2017

bleskes commented May 8, 2017

clintongormley commented May 8, 2017

elasticmachine commented Mar 15, 2018

DaveCTurner commented Apr 17, 2018

dhwanilpatel commented Jan 15, 2020

DaveCTurner commented Jan 17, 2020

DaveCTurner commented Feb 28, 2020

DaveCTurner commented Mar 5, 2020

amathur1893 commented Nov 12, 2020

DaveCTurner commented Nov 12, 2020

xiankaing commented Nov 20, 2020 • edited Loading

jamshid commented Aug 23, 2022

elasticsearchmachine commented Jan 16, 2023

DaveCTurner commented May 30, 2024

xiankaing commented Nov 20, 2020 •

edited

Loading