FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

tlrx · 2019-02-27T16:14:01Z

The test FollowerFailOverIT.testFailOverOnFollower failed today on 7.0:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+internalClusterTest/1446

It does not reproduce locally with:

 ./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=D87CDC85B26F55B8 \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFailOverOnFollower" \
  -Dtests.security.manager=true \
  -Dtests.locale=id-ID \
  -Dtests.timezone=Africa/Ndjamena \
  -Dcompiler.java=11 \
  -Druntime.java=8

It is maybe related to #35403 or #38633 but I haven't found the same exact errors so I'm opening a new issue.

The log shows some timeout issues with the GlobalCheckpointListeners:

14:23:04   1> [2019-02-27T15:20:06,932][TRACE][o.e.i.s.IndexShard       ] [leader1] [leader_test_failover][0] global checkpoint listener timed out
14:23:04   1> java.util.concurrent.TimeoutException: 10ms
14:23:04   1> 	at org.elasticsearch.index.shard.GlobalCheckpointListeners.lambda$add$3(GlobalCheckpointListeners.java:141) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
14:23:04   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
14:23:04   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

And some mismatched documents:

14:23:54    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:23:54    > 		at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertIndexFullyReplicatedToFollower$4(CcrIntegTestCase.java:505)
14:23:54    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:856)
14:23:54    > 		... 39 more
14:23:54    > 	Suppressed: java.lang.AssertionError: mismatched documents [{0=[DocIdSeqNoAndTerm{id='133 seqNo=135 primaryTerm=1}, DocIdSeqNoAndTerm{id='132 seqNo=134 primaryTerm=1}, DocIdSeqNoAndTerm{id='131 seqNo=133 primaryTerm=1}]}]

I tried to isolate the relevant test log:
consoleText.txt

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-27T16:14:04Z

Pinging @elastic/es-distributed

tlrx · 2019-02-27T16:14:10Z

cc @dnhatn

dnhatn · 2019-02-27T18:23:24Z

Thanks @tlrx. I am on it today.

dnhatn · 2019-02-28T00:19:55Z

Finally, I have an explanation for this failure. It happened as follows:

We are restarting a node that has the following primary running.

[2019-02-27T15:18:39,567][INFO ][o.e.t.InternalTestCluster] [testFailOverOnFollower] Restarting node [followerd4] 
[2019-02-27T15:18:39,568][INFO ][o.e.n.Node               ] [testFailOverOnFollower] stopping ...

The following primary just completes a bulk request containing seq#133, seq#134, and seq#135, but it can not send that bulk request to the following replica.

[2019-02-27T15:18:39,596][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] op [indices:data/write/bulk_shard_operations[s]] completed on primary for request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=3, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=1}]
[2019-02-27T15:18:39,610][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] failure while performing [indices:data/write/bulk_shard_operations[s]] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ], request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=10, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=DEFAULT}]
org.elasticsearch.transport.NodeDisconnectedException: [followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected
[2019-02-27T15:18:39,612][WARN ][o.e.t.TransportService   ] [followerd4] Transport response handler not found of id [68]
[2019-02-27T15:18:39,615][WARN ][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] [[follower_test_failover][0]] failed to perform indices:data/write/bulk_shard_operations[s] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ]
org.elasticsearch.transport.NodeDisconnectedException: [followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected

The following primary tries to reach to the master node to fail the following replica but is unable to send the request.

[2019-02-27T15:18:39,621][WARN ][o.e.c.a.s.ShardStateAction] [followerd4] node closed while execution action [internal:cluster/shard/failure] for shard entry [shard id [[follower_test_failover][0]], allocation id [tnsEYc72RWKeUe8141Q7EQ], primary term [1], message [failed to perform indices:data/write/bulk_shard_operations[s] on replica [follower_test_failover][0], node[F0IIukrhRKy9xJP_ZO6YiA], [R], s[STARTED], a[id=tnsEYc72RWKeUe8141Q7EQ]], failure [NodeDisconnectedException[[followerd1][127.0.0.1:43829][indices:data/write/bulk_shard_operations[s][r]] disconnected]], markAsStale [true]]

The following primary then is able to respond SUCCESS to the ShardFollowTask.

[2019-02-27T15:18:39,663][TRACE][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd4] operation succeeded. action [indices:data/write/bulk_shard_operations[s]],request [BulkShardOperationsRequest{historyUUID=LAJy3qUiSLi5WipZHEQmsw, operations=3, maxSeqNoUpdates=99, shardId=[follower_test_failover][0], timeout=1m, index='follower_test_failover', waitForActiveShards=1}]

The following replica is promoted, but it won't have seq#133, seq#134, and seq#135 for the ShardFollowTask won't send these operations unless the FollowTask is restarted.

Throwable #1: java.lang.AssertionError: mismatched documents [{0=[DocIdSeqNoAndTerm{id='133 seqNo=135 primaryTerm=1}, DocIdSeqNoAndTerm{id='132 seqNo=134 primaryTerm=1}, DocIdSeqNoAndTerm{id='131 seqNo=133 primaryTerm=1}]}]

I think this is a blocker for 6.7 and 7.0.

/cc @bleskes @ywelsch @jasontedor @martijnvg

dnhatn · 2019-02-28T04:06:58Z

I see two solutions here: one is to reset lastRequestedSeqNo to followerGlobalCheckpoint when the primary term on the follower increased. Another is to consider a bulk request with successful_shards < total_shards as a failure then retry.

martijnvg · 2019-02-28T07:29:43Z

@dnhatn Thanks for the detailed explanation!

Another is to consider a bulk request with successful_shards < total_shards as a failure then retry.

The problem you described above, only happens when a replica shard is promoted to primary shard. successful_shards < total_shards happens when the we were unable to index into a replica shard, which I think happens more often than a primary promotion. So retrying based in that condition would result in retrying too often and most of the times it wouldn't be needed?

one is to reset lastRequestedSeqNo to followerGlobalCheckpoint when the primary term on the follower increased.

I think it makes sense to reset the followerGlobalCheckpoint to lastRequestedSeqNo in case of primary promotion to avoid the scenario that you have described.

bleskes · 2019-02-28T15:06:20Z

The following primary then is able to respond SUCCESS to the ShardFollowTask.

@dnhatn can you explain why a replicated write operation is finished succesfully after not being to able to fail a replica/mark it as stale?

dnhatn · 2019-03-04T03:22:36Z

@dnhatn can you explain why a replicated write operation is finished succesfully after not being to able to fail a replica/mark it as stale?

Discussed with @bleskes on another channel. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. I opened #39584 to propose the fix.

Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467

tlrx added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Feb 27, 2019

dnhatn self-assigned this Feb 27, 2019

dnhatn added blocker v6.7.0 labels Feb 28, 2019

dnhatn mentioned this issue Feb 28, 2019

[CI] CCR: testFollowIndexAndCloseNode fails #33337

Closed

dnhatn mentioned this issue Mar 1, 2019

Don’t ack if unable to remove failing replica #39584

Merged

dnhatn closed this as completed in #39584 Mar 5, 2019

michaelbaamonde added v7.0.0-rc1 and removed v7.0.0 labels Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

tlrx commented Feb 27, 2019

elasticmachine commented Feb 27, 2019

tlrx commented Feb 27, 2019

dnhatn commented Feb 27, 2019

dnhatn commented Feb 28, 2019

dnhatn commented Feb 28, 2019 •

edited

Loading

martijnvg commented Feb 28, 2019

bleskes commented Feb 28, 2019 •

edited

Loading

dnhatn commented Mar 4, 2019

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

Comments

tlrx commented Feb 27, 2019

elasticmachine commented Feb 27, 2019

tlrx commented Feb 27, 2019

dnhatn commented Feb 27, 2019

dnhatn commented Feb 28, 2019

dnhatn commented Feb 28, 2019 • edited Loading

martijnvg commented Feb 28, 2019

bleskes commented Feb 28, 2019 • edited Loading

dnhatn commented Mar 4, 2019

dnhatn commented Feb 28, 2019 •

edited

Loading

bleskes commented Feb 28, 2019 •

edited

Loading