Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: DiscoveryDisruptionIT.testClusterJoinDespiteOfPublishingIssues fails #37685

Closed
alpar-t opened this issue Jan 22, 2019 · 3 comments
Closed
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jan 22, 2019

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+release-tests/362/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/5306/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/5331/console

Reproduction line

does not reproduce locally

./gradlew :server:integTest -Dtests.seed=82B5BC192ABF596 -Dtests.class=org.elasticsearch.discovery.DiscoveryDisruptionIT -Dtests.method="testElectMasterWithLatestVersion" -Dtests.security.manager=true -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=id -Dtests.timezone=America/Coral_Harbour -Dcompiler.java=11 -Druntime.java=8

Example relevant log:

ERROR   47.5s J7 | DiscoveryDisruptionIT.testClusterJoinDespiteOfPublishingIssues <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 	at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   > 	at org.elasticsearch.discovery.AbstractDisruptionTestCase.assertNoMaster(AbstractDisruptionTestCase.java:169)
   > 	at org.elasticsearch.discovery.AbstractDisruptionTestCase.assertNoMaster(AbstractDisruptionTestCase.java:161)
   > 	at org.elasticsearch.discovery.DiscoveryDisruptionIT.testClusterJoinDespiteOfPublishingIssues(DiscoveryDisruptionIT.java:148)
   > 	at java.lang.Thread.run(Thread.java:748)
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: node [node_td1] still has [{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}] as master expected null, but was:<{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}>
   > 		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:172)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 		... 40 moreThrowable #2: MasterNotDiscoveredException[NodeNotConnectedException[[node_tm0][127.0.0.1:33362] Node not connected]]; nested: NodeNotConnectedException[[node_tm0][127.0.0.1:33362] Node not connected];
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:259)
   > 	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322)
   > 	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249)
   > 	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:565)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:662)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: NodeNotConnectedException[[node_tm0][127.0.0.1:33362] Node not connected]
   > 	at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:156)
   > 	at org.elasticsearch.test.transport.StubbableConnectionManager.getConnection(StubbableConnectionManager.java:90)
   > 	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:557)
   > 	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:219)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:161)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:138)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:58)
   > 	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:141)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:117)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:60)
   > 	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83)
   > 	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
   > 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:676)
   > 	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:45)
   > 	at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
   > 	at org.elasticsearch.test.InternalTestCluster.lambda$assertSameDocIdsOnShards$7(InternalTestCluster.java:1354)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:821)
   > 	at org.elasticsearch.test.InternalTestCluster.assertSameDocIdsOnShards(InternalTestCluster.java:1353)
   > 	at org.elasticsearch.discovery.AbstractDisruptionTestCase.beforeIndexDeletion(AbstractDisruptionTestCase.java:110)
   > 	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:589)
   > 	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2231)
   > 	... 1 more

Frequency

111 times in the last 7 days

@alpar-t alpar-t added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

alpar-t added a commit that referenced this issue Jan 22, 2019
alpar-t added a commit that referenced this issue Jan 22, 2019
@DaveCTurner DaveCTurner self-assigned this Feb 1, 2019
@andrershov andrershov assigned andrershov and unassigned DaveCTurner Feb 6, 2019
@andrershov
Copy link
Contributor

 1> [2019-01-21T06:28:10,583][INFO ][o.e.d.DiscoveryDisruptionIT] [testClusterJoinDespiteOfPublishingIssues] blocking requests from non master [node_td1] to master [node_tm0]
  1> [2019-01-21T06:28:10,861][DEBUG][o.e.c.c.ElectionSchedulerFactory] [node_tm0] scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=543, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}} not starting election
  1> [2019-01-21T06:28:11,556][DEBUG][o.e.c.c.LeaderChecker    ] [node_td1] closed check scheduler received a response, doing nothing
  1> [2019-01-21T06:28:11,556][DEBUG][o.e.c.c.Coordinator      ] [node_td1] onLeaderFailure: coordinator becoming CANDIDATE in term 1 (was FOLLOWER, lastKnownLeader was [Optional[{node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}]])
  1> [2019-01-21T06:28:11,557][DEBUG][o.e.c.s.ClusterApplierService] [node_td1] processing [becoming candidate: onLeaderFailure]: execute
  1> [2019-01-21T06:28:11,556][DEBUG][o.e.t.TransportService   ] [node_td1] Exception while sending request, handler likely already notified due to timeout
  1> [2019-01-21T06:28:11,557][TRACE][o.e.c.s.ClusterApplierService] [node_td1] cluster state updated, source [becoming candidate: onLeaderFailure]
  1> cluster uuid: i_pigrp8QLC0MfshvlT2Zw
  1> version: 3
  1> state uuid: zHVMsMDgTfe9OBvpIAkWPQ
  1> from_diff: false
  1> meta data version: 1
  1>    coordination_metadata:
  1>       term: 1
  1>       last_committed_config: VotingConfiguration{udr7-Lv4R2iFJUccsZ5O-w}
  1>       last_accepted_config: VotingConfiguration{udr7-Lv4R2iFJUccsZ5O-w}
  1>       voting tombstones: []
  1> metadata customs:
  1>    index-graveyard: IndexGraveyard[[]]
  1> blocks: 
  1>    _global_:
  1>       2,no master, blocks WRITE,METADATA_WRITE
  1> nodes: 
  1>    {node_td1}{6BVujxeOTfSjASRNIJkAMg}{B51tSRN0QMuIil3rVFpgtQ}{127.0.0.1}{127.0.0.1:39664}, local
  1>    {node_tm0}{udr7-Lv4R2iFJUccsZ5O-w}{3ZEQc9CqRDWpnGqv-il6mg}{127.0.0.1}{127.0.0.1:33362}
  1> routing_table (version 1):
  1> routing_nodes:
  1> -----node_id[6BVujxeOTfSjASRNIJkAMg][V]
  1> ---- unassigned

After that constantly repeating exceptions from CoordinatorPeerFinder:

 1> [2019-01-21T06:28:11,557][DEBUG][o.e.c.c.C.CoordinatorPeerFinder] [node_td1] Peer{transportAddress=127.0.0.1:33362, discoveryNode=null, peersRequestInFlight=false} connection failed
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:33362] DISCONNECT: simulated
  1> 	at org.elasticsearch.test.transport.MockTransportService.lambda$addFailToSendNoConnectRule$3(MockTransportService.java:222) ~[framework-7.0.0.jar:7.0.0]

And just in 16 seconds, ClusterApplierService is actually applying the state:

1> [2019-01-21T06:28:27,634][DEBUG][o.e.c.s.ClusterApplierService] [node_td1] applying cluster state version 3
  1> [2019-01-21T06:28:27,634][DEBUG][o.e.c.s.ClusterApplierService] [node_td1] apply cluster state with version 3
  1> [2019-01-21T06:28:27,634][TRACE][o.e.c.s.ClusterApplierService] [node_td1] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@74982d6b] with change to version [3]

If we look at the code, the only thing happening between cluster state updated and applying cluster state is checking node availability from the cluster state.
In this case, the check should return immediately with an error, because connection error is simulated.
However, it takes 17 seconds and it could not be GC, because we see logs from CoordinatorPeerFinder.
It could be the management thread pool overflow and that's why task for checking node connections got queed.
However, this test was failing during infrastructure outage period, so I want to enable the test and closely monitor for failure occurences now when infrastructure is restored.

@andrershov
Copy link
Contributor

There are no tests failures after unmute, I'm closing the issue.

andrershov pushed a commit that referenced this issue Jul 31, 2019
…thLatestVersion (#38555)

See my comments for #37539 and #37685

(cherry picked from commit 038d4ab)
andrershov pushed a commit that referenced this issue Jul 31, 2019
…thLatestVersion (#38555)

See my comments for #37539 and #37685

(cherry picked from commit 038d4ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants