Async creation of IndexShard instances #94545

DaveCTurner · 2023-03-20T13:09:55Z

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation.

The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously.

Moreover there's no reason why we have to create the IndexShard while applying the cluster state, because the shard remains in state INITIALIZING, and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process.

Relates #24530

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Instead it makes more sense to wait indefinitely. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530

DaveCTurner · 2023-03-20T13:53:00Z

It's worth considering whether to backport this to 8.7 too.

henningandersen

Did an initial read and left a few comments. I think we should still bound the wait time to ensure that for extraordinary long waits, we still progress on other recoveries.

henningandersen · 2023-03-21T10:47:02Z

server/src/main/java/org/elasticsearch/index/IndexService.java

@@ -434,7 +433,7 @@ public synchronized IndexShard createShard(
        ShardLock lock = null;
        eventListener.beforeIndexShardCreated(routing, indexSettings);
        try {
-            lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5));
+            lock = nodeEnv.shardLock(shardId, "starting shard", 0L);


Can we ensure that shardLock does not throw exception due to interrupted exception somehow? Mostly makes it a lot easier to reason about, do not think there is a problem in it as is.

I don't see an easy way to do this, at least not without reworking the whole shard-lock mechanism. Which I'd love to do at some point, but that's a task for another day. I added a little extra protection in b722143 anyway.

henningandersen · 2023-03-21T10:48:22Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -86,6 +89,12 @@
 public class IndicesClusterStateService extends AbstractLifecycleComponent implements ClusterStateApplier {
    private static final Logger logger = LogManager.getLogger(IndicesClusterStateService.class);

+    public static final Setting<TimeValue> SHARD_LOCK_RETRY_INTERVAL_SETTING = Setting.timeSetting(
+        "indices.store.shard_lock_retry_interval",
+        TimeValue.timeValueSeconds(5),


This seems a bit high to me, considering that this was the expected max wait time in the past. I'd suggest 1s instead.

Ok, I did that in 1379d8c and made it so that we don't emit a WARN log on every retry.

henningandersen · 2023-03-21T10:49:16Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

-            failAndRemoveShard(shardRouting, true, "failed to create shard", e, state);
+            listener.onResponse(true);
+        } catch (ShardLockObtainFailedException e) {
+            logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval);


Can we include information on the wait-time so far in the message?

Not easily as things stand because we stop retrying (and start a new retry loop) on each cluster state update. But possibly, I think to make the wait bounded we'll need to know this information.

henningandersen · 2023-03-21T10:50:31Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

+        } catch (ShardLockObtainFailedException e) {
+            logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval);
+            // TODO could we instead subscribe to the shard lock and trigger the retry exactly when it is released rather than polling?
+            threadPool.scheduleUnlessShuttingDown(


Can we make the time we wait still bounded and configurable? Something like a minute by default seems appropriate to me, highly reducing the risk of seeing shard lock obtain exceptions for nodes that are simply under load, but not waiting too long, delaying the allocation of other shards on the node.

I'm still thinking about how to do this. It's easy enough if there's no cluster state updates for a minute, but if we're constantly updating the cluster state then this implementation will restart the retry loop each time.

henningandersen · 2023-03-21T10:51:36Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

+
+                        final var indexService = indicesService.indexService(shardRouting.index());
+                        if (indexService == null) {
+                            final var message = "index service unexpectedly found for " + shardRouting;


Suggested change

final var message = "index service unexpectedly found for " + shardRouting;

final var message = "index service unexpectedly not found for " + shardRouting;

Done in 52c7d8f.

henningandersen · 2023-03-21T11:00:40Z

server/src/internalClusterTest/java/org/elasticsearch/indices/cluster/ShardLockFailureIT.java

+            updateIndexSettings(Settings.builder().putNull(IndexMetadata.INDEX_ROUTING_EXCLUDE_GROUP_PREFIX + "._name"), indexName);
+            ensureYellow(indexName);
+            assertBusy(mockLogAppender::assertAllExpectationsMatched);
+        }


Can we assert that the index is not green? I know timing may not allow it to fail, but still seems good to assert.

Yep, see 92f31b0

elasticsearchmachine · 2023-03-22T07:36:47Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2023-03-22T07:36:47Z

Pinging @elastic/es-distributed (Team:Distributed)

fcofdez · 2023-03-22T08:44:15Z

server/src/internalClusterTest/java/org/elasticsearch/indices/cluster/ShardLockFailureIT.java

+    protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) {
+        return Settings.builder()
+            .put(super.nodeSettings(nodeOrdinal, otherSettings))
+            .put(IndicesClusterStateService.SHARD_LOCK_RETRY_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(10))


Is this low timeout what prevents acquiring the shard lock?

No, the test thread grabs the lock before creating the last replica here:

https://github.com/elastic/elasticsearch/pull/94545/files#diff-ddfcfd01d38db5bd6715b1585c244b1b8f3ab92aedaa7d6e189e27601d16e407R72

ah right! thanks for the clarification

tlrx · 2023-03-22T09:09:19Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

+            }
+            logger.log(
+                (iteration + 25) % 30 == 0 ? Level.WARN : Level.DEBUG,
+                "shard lock currently unavailable for [{}], retrying in [{}]: [{}]",


I think it would help to have the stateuuid/version of the cluster state here for debugging

++ see e619a47.

tlrx · 2023-03-22T09:12:58Z

server/src/main/java/org/elasticsearch/index/IndexService.java

@@ -434,7 +433,7 @@ public synchronized IndexShard createShard(
        ShardLock lock = null;
        eventListener.beforeIndexShardCreated(routing, indexSettings);
        try {
-            lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5));
+            lock = nodeEnv.shardLock(shardId, "starting shard", 0L);


We can use shardLock(ShardId id, final String details) here. I think a comment describing that the shard creation is retried would be useful too

++ see e4e23f0

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

Removes some unnecessary parameters from internal methods. Extracted from elastic#94545 to reduce noise.

Removes some unnecessary parameters from internal methods. Extracted from #94545 to reduce noise.

henningandersen

This looks good to me. Can we provoke a few runs of the full test suite to ensure at least all the disruptive tests get a chance to run a few times?

henningandersen · 2023-03-22T21:53:48Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -662,6 +832,8 @@ private static DiscoveryNode findSourceNodeForPeerRecovery(RoutingTable routingT
        return sourceNode;
    }

+    private record PendingShardCreation(String clusterStateUuid, long startTimeMillis) {}


nit: I think we (and java) always use UUID, not Uuid.

henningandersen · 2023-03-22T22:08:14Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

+                    pendingShardCreations.remove(shardId, pendingShardCreation);
+                })
+            );
+        } catch (Exception e) {


Do we need to remove from pendingShardCreations here? I wonder if we should catch Exception in createShardWhenLockAvailable and let it call listener.onFailure? I prefer to have methods that either invoke a listener or throw, not both.

Ah yes good catch. Changed to catching all exceptions within the method in d6188af.

tlrx

LGTM

tlrx · 2023-03-23T08:05:23Z

server/src/internalClusterTest/java/org/elasticsearch/indices/cluster/ShardLockFailureIT.java

+        );
+        ensureGreen(indexName);
+
+        final var shardId = client().admin()


We can save a few lines if you want (here and in the other test too):

Suggested change

final var shardId = client().admin()

final var shardId = ShardId(resolveIndex(index), 0)

TIL, thanks 😄

DaveCTurner · 2023-03-23T12:28:24Z

https://gradle-enterprise.elastic.co/s/4t6cp2obsipzy was a test failure that looked to be related to this change (at least it pertains to shard locks and node locks) but I can reproduce it in main (see #94672) and on a deeper look I don't think this change even makes it more likely to occur. Is it just so very unlucky that the first time it's failed like this in years happens to be on a PR that is touching related code? Seems suspicious, but then stranger things have happened too.

@elasticmachine please run elasticsearch-ci/part-1

DaveCTurner · 2023-03-23T12:29:21Z

Can we provoke a few runs of the full test suite to ensure at least all the disruptive tests get a chance to run a few times?

I've had ./gradlew :server:test :server:internalclustertest running on my CI machine in a loop for a few hours without any problems. Do you want a few full CI runs too or is that enough?

henningandersen

LGTM.

I've had ./gradlew :server:test :server:internalclustertest running on my CI machine in a loop for a few hours without any problems. Do you want a few full CI runs too or is that enough?

Thanks, that seems adequate.

It's worth considering whether to backport this to 8.7 too.

I'd prefer to let it burn in for a couple of days first, but otherwise it does seem like it could be worth backporting.

DaveCTurner · 2023-03-23T14:58:19Z

Thanks all!

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530 Backport of elastic#94545 and elastic#94623 (and a little bit of elastic#94417) to 8.7

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates #24530 Backport of #94545 and #94623 (and a little bit of #94417) to 8.7

DaveCTurner · 2023-04-11T07:53:44Z

Backported to 8.7 in #95121

DaveCTurner added >bug WIP :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.8.0 labels Mar 20, 2023

DaveCTurner marked this pull request as ready for review March 20, 2023 13:52

DaveCTurner requested review from henningandersen and original-brownbear March 20, 2023 13:52

henningandersen reviewed Mar 21, 2023

View reviewed changes

DaveCTurner removed the WIP label Mar 22, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 22, 2023

Update docs/changelog/94545.yaml

515997b

DaveCTurner added 5 commits March 22, 2023 07:40

Merge branch 'main' into 2023-03-20-create-shard-async

f0cb1aa

Assert health YELLOW

92f31b0

Include SLOFE message in log

34cc306

Shorter retry rate, but less logging

1379d8c

Fix message

52c7d8f

fcofdez reviewed Mar 22, 2023

View reviewed changes

Bail out early if interrupted

b722143

tlrx reviewed Mar 22, 2023

View reviewed changes

DaveCTurner added 2 commits March 22, 2023 09:29

no timeout is the default, and comment on the retries

e4e23f0

Include state version/UUID

e619a47

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 22, 2023

Minor cleanup in IndicesClusterStateService

b72195f

Removes some unnecessary parameters from internal methods. Extracted from elastic#94545 to reduce noise.

DaveCTurner mentioned this pull request Mar 22, 2023

Minor cleanup in IndicesClusterStateService #94623

Merged

elasticsearchmachine pushed a commit that referenced this pull request Mar 22, 2023

Minor cleanup in IndicesClusterStateService (#94623)

1469ed2

Removes some unnecessary parameters from internal methods. Extracted from #94545 to reduce noise.

DaveCTurner added 2 commits March 22, 2023 10:43

Merge branch 'main' into 2023-03-20-create-shard-async

6072eca

Track total elapsed time per shard and time out

a3e9de5

DaveCTurner added 5 commits March 22, 2023 13:02

Merge branch 'main' into 2023-03-20-create-shard-async

017f8e1

Millis are enough here

c965fc4

Merge branch 'main' into 2023-03-20-create-shard-async

a4f7682

Whitespace

bae00b9

Bogus assertion

15ab415

DaveCTurner requested review from henningandersen and tlrx March 22, 2023 16:07

henningandersen reviewed Mar 22, 2023

View reviewed changes

DaveCTurner added 2 commits March 23, 2023 08:02

Merge branch 'main' into 2023-03-20-create-shard-async

ec8749b

Uuid -> UUID

156e2e5

tlrx approved these changes Mar 23, 2023

View reviewed changes

DaveCTurner added 2 commits March 23, 2023 08:20

Feed exceptions to listener always

d6188af

resolveIndex()

6e93ba3

DaveCTurner requested a review from henningandersen March 23, 2023 13:44

henningandersen approved these changes Mar 23, 2023

View reviewed changes

DaveCTurner merged commit eb82fa2 into elastic:main Mar 23, 2023

DaveCTurner deleted the 2023-03-20-create-shard-async branch March 23, 2023 14:58

DaveCTurner mentioned this pull request Apr 11, 2023

Async creation of IndexShard instances #95121

Merged

DaveCTurner added the v8.7.1 label Apr 11, 2023

	final var message = "index service unexpectedly found for " + shardRouting;
	final var message = "index service unexpectedly not found for " + shardRouting;

	final var shardId = client().admin()
	final var shardId = ShardId(resolveIndex(index), 0)

Async creation of IndexShard instances #94545

Async creation of IndexShard instances #94545

Conversation

DaveCTurner commented Mar 20, 2023 • edited Loading

DaveCTurner commented Mar 20, 2023

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 22, 2023

elasticsearchmachine commented Mar 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Mar 23, 2023

DaveCTurner commented Mar 23, 2023

henningandersen left a comment

Choose a reason for hiding this comment

DaveCTurner commented Mar 23, 2023

DaveCTurner commented Apr 11, 2023

DaveCTurner commented Mar 20, 2023 •

edited

Loading