-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FollowerFailOverIT.testFailOverOnFollower fails on CI #39467
Comments
Pinging @elastic/es-distributed |
cc @dnhatn |
Thanks @tlrx. I am on it today. |
Finally, I have an explanation for this failure. It happened as follows:
I think this is a blocker for 6.7 and 7.0. |
I see two solutions here: one is to reset lastRequestedSeqNo to followerGlobalCheckpoint when the primary term on the follower increased. Another is to consider a bulk request with successful_shards < total_shards as a failure then retry. |
@dnhatn Thanks for the detailed explanation!
The problem you described above, only happens when a replica shard is promoted to primary shard.
I think it makes sense to reset the |
@dnhatn can you explain why a replicated write operation is finished succesfully after not being to able to fail a replica/mark it as stale? |
Discussed with @bleskes on another channel. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. I opened #39584 to propose the fix. |
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
The test
FollowerFailOverIT.testFailOverOnFollower
failed today on 7.0:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+internalClusterTest/1446
It does not reproduce locally with:
It is maybe related to #35403 or #38633 but I haven't found the same exact errors so I'm opening a new issue.
The log shows some timeout issues with the
GlobalCheckpointListeners
:And some mismatched documents:
I tried to isolate the relevant test log:
consoleText.txt
The text was updated successfully, but these errors were encountered: