Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async creation of IndexShard instances #94545

Merged
merged 21 commits into from
Mar 23, 2023

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Mar 20, 2023

Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation.

The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously.

Moreover there's no reason why we have to create the IndexShard while applying the cluster state, because the shard remains in state INITIALIZING, and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process.

Relates #24530

Today when applying a new cluster state we block the cluster applier
thread for up to 5s while waiting to acquire each shard lock. Failure to
acquire the shard lock is treated as an allocation failure, so after 5
retries (by default) we give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will
eventually be released. Yet, 5 retries of 5s each is sometimes not
enough time to wait. Instead it makes more sense to wait indefinitely.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state
`INITIALIZING`, and therefore unused, while it coordinates its own
recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we
schedule a retry (also executed on the cluster state applier thread) and
proceed with the rest of the cluster state application process.

Relates elastic#24530
@DaveCTurner DaveCTurner added >bug WIP :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.8.0 labels Mar 20, 2023
@DaveCTurner DaveCTurner marked this pull request as ready for review March 20, 2023 13:52
@DaveCTurner
Copy link
Contributor Author

It's worth considering whether to backport this to 8.7 too.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial read and left a few comments. I think we should still bound the wait time to ensure that for extraordinary long waits, we still progress on other recoveries.

@@ -434,7 +433,7 @@ public synchronized IndexShard createShard(
ShardLock lock = null;
eventListener.beforeIndexShardCreated(routing, indexSettings);
try {
lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5));
lock = nodeEnv.shardLock(shardId, "starting shard", 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure that shardLock does not throw exception due to interrupted exception somehow? Mostly makes it a lot easier to reason about, do not think there is a problem in it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an easy way to do this, at least not without reworking the whole shard-lock mechanism. Which I'd love to do at some point, but that's a task for another day. I added a little extra protection in b722143 anyway.

@@ -86,6 +89,12 @@
public class IndicesClusterStateService extends AbstractLifecycleComponent implements ClusterStateApplier {
private static final Logger logger = LogManager.getLogger(IndicesClusterStateService.class);

public static final Setting<TimeValue> SHARD_LOCK_RETRY_INTERVAL_SETTING = Setting.timeSetting(
"indices.store.shard_lock_retry_interval",
TimeValue.timeValueSeconds(5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit high to me, considering that this was the expected max wait time in the past. I'd suggest 1s instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did that in 1379d8c and made it so that we don't emit a WARN log on every retry.

failAndRemoveShard(shardRouting, true, "failed to create shard", e, state);
listener.onResponse(true);
} catch (ShardLockObtainFailedException e) {
logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include information on the wait-time so far in the message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily as things stand because we stop retrying (and start a new retry loop) on each cluster state update. But possibly, I think to make the wait bounded we'll need to know this information.

} catch (ShardLockObtainFailedException e) {
logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval);
// TODO could we instead subscribe to the shard lock and trigger the retry exactly when it is released rather than polling?
threadPool.scheduleUnlessShuttingDown(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the time we wait still bounded and configurable? Something like a minute by default seems appropriate to me, highly reducing the risk of seeing shard lock obtain exceptions for nodes that are simply under load, but not waiting too long, delaying the allocation of other shards on the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about how to do this. It's easy enough if there's no cluster state updates for a minute, but if we're constantly updating the cluster state then this implementation will restart the retry loop each time.


final var indexService = indicesService.indexService(shardRouting.index());
if (indexService == null) {
final var message = "index service unexpectedly found for " + shardRouting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final var message = "index service unexpectedly found for " + shardRouting;
final var message = "index service unexpectedly not found for " + shardRouting;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 52c7d8f.

updateIndexSettings(Settings.builder().putNull(IndexMetadata.INDEX_ROUTING_EXCLUDE_GROUP_PREFIX + "._name"), indexName);
ensureYellow(indexName);
assertBusy(mockLogAppender::assertAllExpectationsMatched);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that the index is not green? I know timing may not allow it to fail, but still seems good to assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see 92f31b0

@DaveCTurner DaveCTurner removed the WIP label Mar 22, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 22, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) {
return Settings.builder()
.put(super.nodeSettings(nodeOrdinal, otherSettings))
.put(IndicesClusterStateService.SHARD_LOCK_RETRY_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this low timeout what prevents acquiring the shard lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right! thanks for the clarification

}
logger.log(
(iteration + 25) % 30 == 0 ? Level.WARN : Level.DEBUG,
"shard lock currently unavailable for [{}], retrying in [{}]: [{}]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to have the stateuuid/version of the cluster state here for debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ see e619a47.

@@ -434,7 +433,7 @@ public synchronized IndexShard createShard(
ShardLock lock = null;
eventListener.beforeIndexShardCreated(routing, indexSettings);
try {
lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5));
lock = nodeEnv.shardLock(shardId, "starting shard", 0L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use shardLock(ShardId id, final String details) here. I think a comment describing that the shard creation is retried would be useful too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ see e4e23f0

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 22, 2023
Removes some unnecessary parameters from internal methods. Extracted
from elastic#94545 to reduce noise.
elasticsearchmachine pushed a commit that referenced this pull request Mar 22, 2023
Removes some unnecessary parameters from internal methods. Extracted
from #94545 to reduce noise.
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Can we provoke a few runs of the full test suite to ensure at least all the disruptive tests get a chance to run a few times?

@@ -662,6 +832,8 @@ private static DiscoveryNode findSourceNodeForPeerRecovery(RoutingTable routingT
return sourceNode;
}

private record PendingShardCreation(String clusterStateUuid, long startTimeMillis) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we (and java) always use UUID, not Uuid.

pendingShardCreations.remove(shardId, pendingShardCreation);
})
);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove from pendingShardCreations here? I wonder if we should catch Exception in createShardWhenLockAvailable and let it call listener.onFailure? I prefer to have methods that either invoke a listener or throw, not both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good catch. Changed to catching all exceptions within the method in d6188af.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

);
ensureGreen(indexName);

final var shardId = client().admin()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save a few lines if you want (here and in the other test too):

Suggested change
final var shardId = client().admin()
final var shardId = ShardId(resolveIndex(index), 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks 😄

@DaveCTurner
Copy link
Contributor Author

https://gradle-enterprise.elastic.co/s/4t6cp2obsipzy was a test failure that looked to be related to this change (at least it pertains to shard locks and node locks) but I can reproduce it in main (see #94672) and on a deeper look I don't think this change even makes it more likely to occur. Is it just so very unlucky that the first time it's failed like this in years happens to be on a PR that is touching related code? Seems suspicious, but then stranger things have happened too.

@elasticmachine please run elasticsearch-ci/part-1

@DaveCTurner
Copy link
Contributor Author

Can we provoke a few runs of the full test suite to ensure at least all the disruptive tests get a chance to run a few times?

I've had ./gradlew :server:test :server:internalclustertest running on my CI machine in a loop for a few hours without any problems. Do you want a few full CI runs too or is that enough?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I've had ./gradlew :server:test :server:internalclustertest running on my CI machine in a loop for a few hours without any problems. Do you want a few full CI runs too or is that enough?

Thanks, that seems adequate.

It's worth considering whether to backport this to 8.7 too.

I'd prefer to let it burn in for a couple of days first, but otherwise it does seem like it could be worth backporting.

@DaveCTurner DaveCTurner merged commit eb82fa2 into elastic:main Mar 23, 2023
@DaveCTurner DaveCTurner deleted the 2023-03-20-create-shard-async branch March 23, 2023 14:58
@DaveCTurner
Copy link
Contributor Author

Thanks all!

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2023
Today when applying a new cluster state we block the cluster applier thread for
up to 5s while waiting to acquire each shard lock. Failure to acquire the shard
lock is treated as an allocation failure, so after 5 retries (by default) we
give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will eventually
be released. Yet, 5 retries of 5s each is sometimes not enough time to wait.
Knowing that the shard lock will eventually be released, we can retry much more
tenaciously.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state `INITIALIZING`,
and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we schedule a
retry (also executed on the cluster state applier thread) and proceed with the
rest of the cluster state application process.

Relates elastic#24530
Backport of elastic#94545 and elastic#94623 (and a little bit of elastic#94417) to 8.7
DaveCTurner added a commit that referenced this pull request Apr 11, 2023
Today when applying a new cluster state we block the cluster applier thread for
up to 5s while waiting to acquire each shard lock. Failure to acquire the shard
lock is treated as an allocation failure, so after 5 retries (by default) we
give up on the allocation.

The shard lock may be held by some other actor, typically the previous
incarnation of the shard which is still shutting down, but it will eventually
be released. Yet, 5 retries of 5s each is sometimes not enough time to wait.
Knowing that the shard lock will eventually be released, we can retry much more
tenaciously.

Moreover there's no reason why we have to create the `IndexShard` while
applying the cluster state, because the shard remains in state `INITIALIZING`,
and therefore unused, while it coordinates its own recovery.

With this commit we try and acquire the shard lock during cluster state
application, but do not wait if the lock is unavailable. Instead, we schedule a
retry (also executed on the cluster state applier thread) and proceed with the
rest of the cluster state application process.

Relates #24530
Backport of #94545 and #94623 (and a little bit of #94417) to 8.7
@DaveCTurner
Copy link
Contributor Author

Backported to 8.7 in #95121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.7.1 v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants