Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

Closed
tlrx opened this issue Feb 27, 2019 · 8 comments · Fixed by #39584
Closed

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

tlrx opened this issue Feb 27, 2019 · 8 comments · Fixed by #39584
Assignees
Labels
blocker :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v6.7.0 v7.0.0-rc1

Comments

@tlrx
Copy link
Member

tlrx commented Feb 27, 2019

The test FollowerFailOverIT.testFailOverOnFollower failed today on 7.0:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+internalClusterTest/1446

It does not reproduce locally with:

 ./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=D87CDC85B26F55B8 \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFailOverOnFollower" \
  -Dtests.security.manager=true \
  -Dtests.locale=id-ID \
  -Dtests.timezone=Africa/Ndjamena \
  -Dcompiler.java=11 \
  -Druntime.java=8

It is maybe related to #35403 or #38633 but I haven't found the same exact errors so I'm opening a new issue.

The log shows some timeout issues with the GlobalCheckpointListeners:

14:23:04   1> [2019-02-27T15:20:06,932][TRACE][o.e.i.s.IndexShard       ] [leader1] [leader_test_failover][0] global checkpoint listener timed out
14:23:04   1> java.util.concurrent.TimeoutException: 10ms
14:23:04   1> 	at org.elasticsearch.index.shard.GlobalCheckpointListeners.lambda$add$3(GlobalCheckpointListeners.java:141) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
14:23:04   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 

And some mismatched documents:

14:23:54    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:23:54    > 		at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertIndexFullyReplicatedToFollower$4(CcrIntegTestCase.java:505)
14:23:54    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:856)
14:23:54    > 		... 39 more
14:23:54    > 	Suppressed: java.lang.AssertionError: mismatched documents [{0=[DocIdSeqNoAndTerm{id='133 seqNo=135 primaryTerm=1}, DocIdSeqNoAndTerm{id='132 seqNo=134 primaryTerm=1}, DocIdSeqNoAndTerm{id='131 seqNo=133 primaryTerm=1}]}]

I tried to isolate the relevant test log:
consoleText.txt

@tlrx tlrx added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Feb 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member Author

tlrx commented Feb 27, 2019

cc @dnhatn

@dnhatn dnhatn self-assigned this Feb 27, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 27, 2019

Thanks @tlrx. I am on it today.

@dnhatn
Copy link
Member

dnhatn commented Feb 28, 2019

Finally, I have an explanation for this failure. It happened as follows:

  1. We are restarting a node that has the following primary running.
[2019-02-27T15:18:39,567][INFO ][o.e.t.InternalTestCluster] [testFailOverOnFollower] Restarting node [followerd4] 
[2019-02-27T15:18:39,568][INFO ][o.e.n.Node               ] [testFailOverOnFollower] stopping ...
  1. The following primary just completes a bulk request containing seq#133, seq#134, and seq#135, but it can not send that bulk request to the following replica.
[2019-02-27T15:18:39,596][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] op [indices:data/write/bulk_shard_operations[s]] completed on primary for request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=3, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=1}]
[2019-02-27T15:18:39,610][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] failure while performing [indices:data/write/bulk_shard_operations[s]] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ], request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=10, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=DEFAULT}]
org.elasticsearch.transport.NodeDisconnectedException: [followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected
[2019-02-27T15:18:39,612][WARN ][o.e.t.TransportService   ] [followerd4] Transport response handler not found of id [68]
[2019-02-27T15:18:39,615][WARN ][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] failed to perform indices:data/write/bulk_shard_operations[s] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ]
org.elasticsearch.transport.NodeDisconnectedException: [followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected
  1. The following primary tries to reach to the master node to fail the following replica but is unable to send the request.
[2019-02-27T15:18:39,621][WARN ][o.e.c.a.s.ShardStateAction] [followerd4] node closed while execution action [internal:cluster/shard/failure] for shard entry [shard id [[follower_test_failover][0]], allocation id [tnsEYc72RWKeUe8141Q7EQ], primary term [1], message [failed to perform indices:data/write/bulk_shard_operations[s] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ]], failure [NodeDisconnectedException[[followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected]], markAsStale [true]]
  1. The following primary then is able to respond SUCCESS to the ShardFollowTask.
[2019-02-27T15:18:39,663][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] operation succeeded. action [indices:data/write/bulk_shard_operations[s]],request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=3, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=1}]
  1. The following replica is promoted, but it won't have seq#133, seq#134, and seq#135 for the ShardFollowTask won't send these operations unless the FollowTask is restarted.
Throwable #1: java.lang.AssertionError: mismatched documents [{0=[DocIdSeqNoAndTerm{id='133 seqNo=135 primaryTerm=1}, DocIdSeqNoAndTerm{id='132 seqNo=134 primaryTerm=1}, DocIdSeqNoAndTerm{id='131 seqNo=133 primaryTerm=1}]}]

I think this is a blocker for 6.7 and 7.0.

/cc @bleskes @ywelsch @jasontedor @martijnvg

@dnhatn
Copy link
Member

dnhatn commented Feb 28, 2019

I see two solutions here: one is to reset lastRequestedSeqNo to followerGlobalCheckpoint when the primary term on the follower increased. Another is to consider a bulk request with successful_shards < total_shards as a failure then retry.

@martijnvg
Copy link
Member

@dnhatn Thanks for the detailed explanation!

Another is to consider a bulk request with successful_shards < total_shards as a failure then retry.

The problem you described above, only happens when a replica shard is promoted to primary shard. successful_shards < total_shards happens when the we were unable to index into a replica shard, which I think happens more often than a primary promotion. So retrying based in that condition would result in retrying too often and most of the times it wouldn't be needed?

one is to reset lastRequestedSeqNo to followerGlobalCheckpoint when the primary term on the follower increased.

I think it makes sense to reset the followerGlobalCheckpoint to lastRequestedSeqNo in case of primary promotion to avoid the scenario that you have described.

@bleskes
Copy link
Contributor

bleskes commented Feb 28, 2019

The following primary then is able to respond SUCCESS to the ShardFollowTask.

@dnhatn can you explain why a replicated write operation is finished succesfully after not being to able to fail a replica/mark it as stale?

@dnhatn
Copy link
Member

dnhatn commented Mar 4, 2019

@dnhatn can you explain why a replicated write operation is finished succesfully after not being to able to fail a replica/mark it as stale?

Discussed with @bleskes on another channel. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. I opened #39584 to propose the fix.

dnhatn added a commit that referenced this issue Mar 5, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this issue Mar 5, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this issue Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this issue Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this issue Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this issue Mar 7, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v6.7.0 v7.0.0-rc1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants