Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential backoff of failed allocation #24530

Closed
clintongormley opened this issue May 6, 2017 · 15 comments
Closed

Exponential backoff of failed allocation #24530

clintongormley opened this issue May 6, 2017 · 15 comments
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@clintongormley
Copy link
Contributor

In #18467 we solved the problem where the failed allocation of a shard is retried in a tight loop, filling up the log file with exceptions. Now, after five failures, the allocation is no longer attempted until the user triggers it.

The downside of this approach is that is requires user intervention.

Would it be possible to add some kind of exponential backoff so that allocation attempts continue to be made, but with less frequency. That way we still avoid flooding the logs but if the situation resolves itself, the shard will be allocated automatically.

@clintongormley
Copy link
Contributor Author

CC @ywelsch, @bleskes

@bleskes
Copy link
Contributor

bleskes commented May 8, 2017

I think it's good to explore this. We can still keep the hard limit (and may increase it) - we built the feature for configuration mistakes - but delay the speed of re-assignment.

@clintongormley did you run into a specific issue that triggered this?

@clintongormley
Copy link
Contributor Author

@bleskes Just from user feedback

@bleskes bleskes added help wanted adoptme and removed discuss labels May 12, 2017
@dnhatn dnhatn self-assigned this Oct 11, 2017
@dnhatn dnhatn removed the help wanted adoptme label Oct 12, 2017
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Oct 23, 2017
Previously, a failed allocation was retried in a tight loop that filled
up log files and caused the cluster be unstable. We solved this problem
by limiting the number of retries. However, this solution requires
manual intervention when the environment is adjusted. This PR aims to
reduce user intervention by increasing the number of retries and adding
some exponential backoff delays between retries.

Closes elastic#24530
@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
@DaveCTurner DaveCTurner added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Mar 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

FWIW I think we should lose the limit and just keep trying, at sufficiently low frequency for it not to be disruptive (e.g. back off until once-per-hour)

@original-brownbear original-brownbear self-assigned this Jan 8, 2019
@original-brownbear original-brownbear removed their assignment Oct 3, 2019
@DaveCTurner DaveCTurner self-assigned this Oct 16, 2019
@dhwanilpatel
Copy link

Hello,
If nobody is working on it, I would like to pick it up.
Any initial thoughts are welcomed. 🙂

@DaveCTurner DaveCTurner removed the help wanted adoptme label Jan 17, 2020
@DaveCTurner
Copy link
Contributor

Thanks @dhwanilpatel. I've already started working on this. I've removed the misleading help wanted label.

@DaveCTurner
Copy link
Contributor

As far as I've been able to tell, the only case where we need indefinite retries is where the allocation repeatedly fails due to a ShardLockObtainFailedException because the shard is already open on the node thanks to an earlier allocation, and although it's in the process of shutting down it does not do so quickly enough. Frequently this is due to a temporarily flaky network resulting in a node leaving and rejoining the cluster a few times. By default, we wait for 5 seconds and retry 5 times, but it's definitely possible today for a shard to take more than 25 seconds to shut down.

The effect of the proposal here would be to keep retrying until the shard eventually shuts down, no matter how long that takes. I would prefer that we address the underlying causes of slow shard shutdowns, because this will bring the cluster back to health much more quickly and will result in fewer full-shard recoveries after a network wobble.

@DaveCTurner
Copy link
Contributor

Another reason for failing allocations that eventually succeed is CircuitBreakingExceptions; we are discussing making recoveries more resilient to memory pressure in #44484.

A related point is that we typically only repeatedly try allocation on one or two nodes, because we only avoid the very last failed node in the ReplicaShardAllocator. Since #48265 we keep track of the nodes behind all failed allocations, so we could make use this to try more nodes.

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@amathur1893
Copy link

Hi,

any update on this? Has this been picked yet?

@DaveCTurner
Copy link
Contributor

Work continues on making it so we no longer need this feature, yes.

@xiankaing
Copy link

xiankaing commented Nov 20, 2020

I doubt that it would be possible to avoid all scenarios of need exponential backoff on retries.
Flakey networks are probably here to stay, at least at times.

If we're bothering to retry shard allocation anyway, why not do it right and have a backoff system?
It's an easy win (vs. fixing all the infinite possible issues).

By the way, using AWS' ES service on CN-Northeast unassigned shards come up after N retries more frequently than it seems reasonable. Something like once a month.
But if I just bump the retry count by 1 (so that it will retry once more), some human amount of time after cluster status becomes red, it works.

@jamshid
Copy link

jamshid commented Aug 23, 2022

Seeing this on 7.5.2, I guess nodes were out of space but it stopped retrying? That seems like something that should be permanently retried rather than requiring a manual curl POST.

{
  "index" : "index_X0",
  "shard" : 3,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2022-08-09T10:16:09.963Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [Y]: failed recovery, failure RecoveryFailedException[[index_X][3]: Recovery failed from {Z}{A}{j-B}{10.x.x.x}{10.x.x.x:9300}{dilm}{ml.machine_memory=16637530112, ml.max_open_jobs=20, xpack.installed=true} into {X}{Y}{Z}{10.x.x.x}{10.x.x.x:9300}{dilm}{ml.machine_memory=16637530112, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[Y][10.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[Y][10.x.x.x:9300][internal:index/shard/recovery/file_chunk]]; nested: IOException[No space left on device]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",

@DaveCTurner DaveCTurner removed their assignment Jan 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 20, 2023
Today when applying a new cluster state we block the cluster applier
thread for up to 5s while waiting to acquire each shard lock. Failure to
acquire the shard lock is treated as an allocation failure, so after 5
retries (by default) we give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will
eventually be released. Yet, 5 retries of 5s each is sometimes not
enough time to wait. Instead it makes more sense to wait indefinitely.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state
`INITIALIZING`, and therefore unused, while it coordinates its own
recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we
schedule a retry (also executed on the cluster state applier thread) and
proceed with the rest of the cluster state application process.

Relates elastic#24530
DaveCTurner added a commit that referenced this issue Mar 23, 2023
Today when applying a new cluster state we block the cluster applier thread for
up to 5s while waiting to acquire each shard lock. Failure to acquire the shard
lock is treated as an allocation failure, so after 5 retries (by default) we
give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will eventually
be released. Yet, 5 retries of 5s each is sometimes not enough time to wait.
Knowing that the shard lock will eventually be released, we can retry much more
tenaciously.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state `INITIALIZING`,
and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we schedule a
retry (also executed on the cluster state applier thread) and proceed with the
rest of the cluster state application process.

Relates #24530
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2023
Today when applying a new cluster state we block the cluster applier thread for
up to 5s while waiting to acquire each shard lock. Failure to acquire the shard
lock is treated as an allocation failure, so after 5 retries (by default) we
give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will eventually
be released. Yet, 5 retries of 5s each is sometimes not enough time to wait.
Knowing that the shard lock will eventually be released, we can retry much more
tenaciously.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state `INITIALIZING`,
and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we schedule a
retry (also executed on the cluster state applier thread) and proceed with the
rest of the cluster state application process.

Relates elastic#24530
Backport of elastic#94545 and elastic#94623 (and a little bit of elastic#94417) to 8.7
DaveCTurner added a commit that referenced this issue Apr 11, 2023
Today when applying a new cluster state we block the cluster applier thread for
up to 5s while waiting to acquire each shard lock. Failure to acquire the shard
lock is treated as an allocation failure, so after 5 retries (by default) we
give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will eventually
be released. Yet, 5 retries of 5s each is sometimes not enough time to wait.
Knowing that the shard lock will eventually be released, we can retry much more
tenaciously.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state `INITIALIZING`,
and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we schedule a
retry (also executed on the cluster state applier thread) and proceed with the
rest of the cluster state application process.

Relates #24530
Backport of #94545 and #94623 (and a little bit of #94417) to 8.7
@DaveCTurner
Copy link
Contributor

Recent changes such as #95121 and #108145 have greatly diminished the failure rate for shard allocation due to unavailable shard locks, and other miscellaneous changes have made it less susceptible to memory pressure too. We'll continue to address other reasons for failed allocations, but as a general rule we'd rather make the recovery process resilient to failure at lower levels and avoid retrying the top-level allocation completely. Therefore I'm closing this.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests