-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async creation of IndexShard instances #94545
Async creation of IndexShard instances #94545
Conversation
Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Instead it makes more sense to wait indefinitely. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530
It's worth considering whether to backport this to 8.7 too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial read and left a few comments. I think we should still bound the wait time to ensure that for extraordinary long waits, we still progress on other recoveries.
@@ -434,7 +433,7 @@ public synchronized IndexShard createShard( | |||
ShardLock lock = null; | |||
eventListener.beforeIndexShardCreated(routing, indexSettings); | |||
try { | |||
lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5)); | |||
lock = nodeEnv.shardLock(shardId, "starting shard", 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ensure that shardLock
does not throw exception due to interrupted exception somehow? Mostly makes it a lot easier to reason about, do not think there is a problem in it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an easy way to do this, at least not without reworking the whole shard-lock mechanism. Which I'd love to do at some point, but that's a task for another day. I added a little extra protection in b722143 anyway.
@@ -86,6 +89,12 @@ | |||
public class IndicesClusterStateService extends AbstractLifecycleComponent implements ClusterStateApplier { | |||
private static final Logger logger = LogManager.getLogger(IndicesClusterStateService.class); | |||
|
|||
public static final Setting<TimeValue> SHARD_LOCK_RETRY_INTERVAL_SETTING = Setting.timeSetting( | |||
"indices.store.shard_lock_retry_interval", | |||
TimeValue.timeValueSeconds(5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit high to me, considering that this was the expected max wait time in the past. I'd suggest 1s instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did that in 1379d8c and made it so that we don't emit a WARN
log on every retry.
failAndRemoveShard(shardRouting, true, "failed to create shard", e, state); | ||
listener.onResponse(true); | ||
} catch (ShardLockObtainFailedException e) { | ||
logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include information on the wait-time so far in the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easily as things stand because we stop retrying (and start a new retry loop) on each cluster state update. But possibly, I think to make the wait bounded we'll need to know this information.
} catch (ShardLockObtainFailedException e) { | ||
logger.warn("shard lock currently unavailable for [{}], retrying in [{}]", shardRouting, shardLockRetryInterval); | ||
// TODO could we instead subscribe to the shard lock and trigger the retry exactly when it is released rather than polling? | ||
threadPool.scheduleUnlessShuttingDown( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the time we wait still bounded and configurable? Something like a minute by default seems appropriate to me, highly reducing the risk of seeing shard lock obtain exceptions for nodes that are simply under load, but not waiting too long, delaying the allocation of other shards on the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking about how to do this. It's easy enough if there's no cluster state updates for a minute, but if we're constantly updating the cluster state then this implementation will restart the retry loop each time.
|
||
final var indexService = indicesService.indexService(shardRouting.index()); | ||
if (indexService == null) { | ||
final var message = "index service unexpectedly found for " + shardRouting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final var message = "index service unexpectedly found for " + shardRouting; | |
final var message = "index service unexpectedly not found for " + shardRouting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 52c7d8f.
updateIndexSettings(Settings.builder().putNull(IndexMetadata.INDEX_ROUTING_EXCLUDE_GROUP_PREFIX + "._name"), indexName); | ||
ensureYellow(indexName); | ||
assertBusy(mockLogAppender::assertAllExpectationsMatched); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that the index is not green? I know timing may not allow it to fail, but still seems good to assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, see 92f31b0
Hi @DaveCTurner, I've created a changelog YAML for you. |
Pinging @elastic/es-distributed (Team:Distributed) |
protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { | ||
return Settings.builder() | ||
.put(super.nodeSettings(nodeOrdinal, otherSettings)) | ||
.put(IndicesClusterStateService.SHARD_LOCK_RETRY_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this low timeout what prevents acquiring the shard lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the test thread grabs the lock before creating the last replica here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right! thanks for the clarification
} | ||
logger.log( | ||
(iteration + 25) % 30 == 0 ? Level.WARN : Level.DEBUG, | ||
"shard lock currently unavailable for [{}], retrying in [{}]: [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to have the stateuuid/version of the cluster state here for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ see e619a47.
@@ -434,7 +433,7 @@ public synchronized IndexShard createShard( | |||
ShardLock lock = null; | |||
eventListener.beforeIndexShardCreated(routing, indexSettings); | |||
try { | |||
lock = nodeEnv.shardLock(shardId, "starting shard", TimeUnit.SECONDS.toMillis(5)); | |||
lock = nodeEnv.shardLock(shardId, "starting shard", 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use shardLock(ShardId id, final String details)
here. I think a comment describing that the shard creation is retried would be useful too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ see e4e23f0
server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
Show resolved
Hide resolved
Removes some unnecessary parameters from internal methods. Extracted from elastic#94545 to reduce noise.
Removes some unnecessary parameters from internal methods. Extracted from #94545 to reduce noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Can we provoke a few runs of the full test suite to ensure at least all the disruptive tests get a chance to run a few times?
@@ -662,6 +832,8 @@ private static DiscoveryNode findSourceNodeForPeerRecovery(RoutingTable routingT | |||
return sourceNode; | |||
} | |||
|
|||
private record PendingShardCreation(String clusterStateUuid, long startTimeMillis) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we (and java) always use UUID, not Uuid.
pendingShardCreations.remove(shardId, pendingShardCreation); | ||
}) | ||
); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to remove from pendingShardCreations
here? I wonder if we should catch Exception in createShardWhenLockAvailable
and let it call listener.onFailure
? I prefer to have methods that either invoke a listener or throw, not both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good catch. Changed to catching all exceptions within the method in d6188af.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
); | ||
ensureGreen(indexName); | ||
|
||
final var shardId = client().admin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save a few lines if you want (here and in the other test too):
final var shardId = client().admin() | |
final var shardId = ShardId(resolveIndex(index), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, thanks 😄
https://gradle-enterprise.elastic.co/s/4t6cp2obsipzy was a test failure that looked to be related to this change (at least it pertains to shard locks and node locks) but I can reproduce it in @elasticmachine please run elasticsearch-ci/part-1 |
I've had |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I've had ./gradlew :server:test :server:internalclustertest running on my CI machine in a loop for a few hours without any problems. Do you want a few full CI runs too or is that enough?
Thanks, that seems adequate.
It's worth considering whether to backport this to 8.7 too.
I'd prefer to let it burn in for a couple of days first, but otherwise it does seem like it could be worth backporting.
Thanks all! |
Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates elastic#24530 Backport of elastic#94545 and elastic#94623 (and a little bit of elastic#94417) to 8.7
Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation. The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously. Moreover there's no reason why we have to create the `IndexShard` while applying the cluster state, because the shard remains in state `INITIALIZING`, and therefore unused, while it coordinates its own recovery. With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process. Relates #24530 Backport of #94545 and #94623 (and a little bit of #94417) to 8.7
Backported to 8.7 in #95121 |
Today when applying a new cluster state we block the cluster applier thread for up to 5s while waiting to acquire each shard lock. Failure to acquire the shard lock is treated as an allocation failure, so after 5 retries (by default) we give up on the allocation.
The shard lock may be held by some other actor, typically the previous incarnation of the shard which is still shutting down, but it will eventually be released. Yet, 5 retries of 5s each is sometimes not enough time to wait. Knowing that the shard lock will eventually be released, we can retry much more tenaciously.
Moreover there's no reason why we have to create the
IndexShard
while applying the cluster state, because the shard remains in stateINITIALIZING
, and therefore unused, while it coordinates its own recovery.With this commit we try and acquire the shard lock during cluster state application, but do not wait if the lock is unavailable. Instead, we schedule a retry (also executed on the cluster state applier thread) and proceed with the rest of the cluster state application process.
Relates #24530