Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ActivateWatchTests fails occasionally on master not able to delete index #66495

Closed
breskeby opened this issue Dec 17, 2020 · 13 comments · Fixed by #67554 or #82157
Closed

[CI] ActivateWatchTests fails occasionally on master not able to delete index #66495

breskeby opened this issue Dec 17, 2020 · 13 comments · Fixed by #67554 or #82157
Assignees
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@breskeby
Copy link
Contributor

Seeing this failing occasionally (4 out of ~1200 in 7 days). I wonder if this is showing some specific issue with the test or with the infrastructure as the error seems to indicate some sort of a permission issue?

Build scan:

https://gradle-enterprise.elastic.co/s/usis5fhbuqx6y/tests/:x-pack:plugin:watcher:internalClusterTest/org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests/testDeactivateAndActivate#1

Repro line:

./gradlew ':x-pack:plugin:watcher:internalClusterTest' --tests "org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests.testDeactivateAndActivate" -Dtests.seed=FE8DAE73AB0A616A -Dtests.security.manager=true -Dtests.locale=ru-RU -Dtests.timezone=US/Central -Druntime.java=11

Reproduces locally?:

no.

Applicable branches:

master

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Europe/Berlin&tests.container=org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests&tests.sortField=FAILED&tests.test=testDeactivateAndActivate&tests.unstableOnly=true
-->
Failure excerpt:

java.lang.IllegalArgumentException: index [.ds-.watcher-history-14-2020.12.17-000001] is the write index for data stream [.watcher-history-14] and cannot be deleted
@breskeby breskeby added the >test-failure Triaged test failures from CI label Dec 17, 2020
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Dec 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@martijnvg martijnvg self-assigned this Jan 6, 2021
@martijnvg
Copy link
Member

During the teardown, the test attempts to delete write index of data stream. This is prohibited, a write index of a data stream must always exist (other backing indices can be deleted). The data stream should be removed in its entirety, which is what is supposed to happen via AbstractWatcherIntegrationTestCase#_cleanup(...), but somehow the test infra tries to just delete the write index. Not sure why this happens yet.

@martijnvg
Copy link
Member

It looks like the watcher history data stream gets created after watcher has been stopped and the previous data stream has been removed:

1> [2020-12-17T03:30:00,522][INFO ][o.e.x.w.WatcherLifeCycleService] [node_s0] watcher has stopped |  
-- | --
  | 1> [2020-12-17T03:30:00,525][INFO ][o.e.x.w.t.a.a.ActivateWatchTests] [testDeactivateAndActivate] waiting to stop watcher, current states [Tuple [v1=node_s3 (0), v2=STOPPED], Tuple [v1=node_s2 (0), v2=STOPPED], Tuple [v1=node_s0 (0), v2=STOPPED], Tuple [v1=node_s1 (0), v2=STOPPED]] |  
  | 1> [2020-12-17T03:30:00,527][INFO ][o.e.x.d.a.DeleteDataStreamTransportAction] [node_s0] removing data stream [.watcher-history-14] |  
  | 1> [2020-12-17T03:30:00,527][INFO ][o.e.c.m.MetadataDeleteIndexService] [node_s0] [.ds-.watcher-history-14-2020.12.17-000001/S9MRwpX-SmWEaSLkl4DrNw] deleting index |  
  | 1> [2020-12-17T03:30:00,608][INFO ][o.e.x.w.t.a.a.ActivateWatchTests] [testDeactivateAndActivate] [ActivateWatchTests#testDeactivateAndActivate]: cleaning up after test |  
  | 1> [2020-12-17T03:30:00,622][INFO ][o.e.c.m.MetadataCreateIndexService] [node_s0] [.ds-.watcher-history-14-2020.12.17-000001] creating index, cause [initialize_data_stream], templates [.watch-history-14], shards [1]/[0] |  
  | 1> [2020-12-17T03:30:00,624][INFO ][o.e.c.m.MetadataCreateDataStreamService] [node_s0] adding data stream [.watcher-history-14] with write index [.ds-.watcher-history-14-2020.12.17-000001] and backing indices [] |  
  | 1> [2020-12-17T03:30:00,720][INFO ][o.e.x.i.IndexLifecycleTransition] [node_s0] moving index [.ds-.watcher-history-14-2020.12.17-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [watch-history-ilm-policy] |  
  | 1> [2020-12-17T03:30:00,757][INFO ][o.e.c.r.a.AllocationService] [node_s0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ds-.watcher-history-14-2020.12.17-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ds-.watcher-history-14-2020.12.17-000001][0]]" |  
  | 1> [2020-12-17T03:30:00,837][INFO ][o.e.c.m.MetadataMappingService] [node_s0] [.ds-.watcher-history-14-2020.12.17-000001/1Ysc1vvwSZ2mHgNjxAGjgg] update_mapping [_doc]

I suspect this happens because write requests for watch history are still inflight after watcher has been stopped.
The exception is most likely thrown when the test cluster is wiped (which deletes all indices), which is after when AbstractWatcherIntegrationTestCase has removed the watcher history data stream and stopped watcher.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 15, 2021
by waiting for all tasks to complete before deleting watcher history data stream.

Closes elastic#66495
martijnvg added a commit that referenced this issue Jan 15, 2021
by waiting for all tasks to complete before deleting watcher history data stream.

Closes #66495
@hendrikmuhs
Copy link

I see the same stacktrace in a failure today:

https://gradle-enterprise.elastic.co/s/5hq6kmhcgi53i/console-log?anchor=7304

@hendrikmuhs hendrikmuhs reopened this Feb 11, 2021
@henningandersen
Copy link
Contributor

henningandersen commented Mar 30, 2021

This happened in a PR build here:
https://gradle-enterprise.elastic.co/s/gwj4needgfl7s

pretty sure it is not related to my changes, since the only change I did was to merge in master.

@pgomulka
Copy link
Contributor

another failure on 6.8
https://gradle-enterprise.elastic.co/s/3sfkpghtsfryc

./gradlew ':x-pack:plugin:watcher:unitTest' -Dtests.seed=12EADBE525B1E30A -Dtests.class=org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests -Dtests.method="testDeactivateAndActivate" -Dtests.security.manager=true -Dtests.locale=nl -Dtests.timezone=Europe/Saratov -Dcompiler.java=11 -Druntime.java=8 

does not reproduce locally

@dimitris-athanasiou
Copy link
Contributor

And another on master https://gradle-enterprise.elastic.co/s/haa4hv337tiaq

Also this seems to be the same issue as #61538. I'll close that as this one has had more discussion.

@ywangd
Copy link
Member

ywangd commented Jul 19, 2021

Another one on master https://gradle-enterprise.elastic.co/s/hy5u6idtonkea

@astefan
Copy link
Contributor

astefan commented Sep 3, 2021

Happened today again on master: https://gradle-enterprise.elastic.co/s/t3me7fp2rks7w

REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:internalClusterTest' --tests "org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests.testDeactivateAndActivate" -Dtests.seed=C7DEBD15F3F219A9 -Dtests.locale=bg-BG -Dtests.timezone=Asia/Omsk -Druntime.java=11

org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests > testLoadWatchWithoutAState FAILED
    java.lang.IllegalArgumentException: index [.ds-.watcher-history-14-2021.09.03-000001] is the write index for data stream [.watcher-history-14] and cannot be deleted
        at __randomizedtesting.SeedInfo.seed([C7DEBD15F3F219A9:300CDB785EC3DD55]:0)
        at org.elasticsearch.cluster.metadata.MetadataDeleteIndexService.deleteIndices(MetadataDeleteIndexService.java:85)
        at org.elasticsearch.cluster.metadata.MetadataDeleteIndexService$1.execute(MetadataDeleteIndexService.java:67)
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48)
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:745)
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:361)
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:224)
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:161)
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

@benwtrent
Copy link
Member

Happened again off of main branch: https://gradle-enterprise.elastic.co/s/y6r4m2tn4ogo2

trace

org.elasticsearch.xpack.watcher.transport.action.activate.ActivateWatchTests > testDeactivateAndActivate FAILED |  
-- | --
  | java.lang.IllegalArgumentException: index [.ds-.watcher-history-14-2021.11.29-000001] is the write index for data stream [.watcher-history-14] and cannot be deleted |  
  | at __randomizedtesting.SeedInfo.seed([97D850CBD028BB52:378028361F716D2E]:0) |  
  | at org.elasticsearch.cluster.metadata.MetadataDeleteIndexService.deleteIndices(MetadataDeleteIndexService.java:90) |  
  | at org.elasticsearch.cluster.metadata.MetadataDeleteIndexService$1.execute(MetadataDeleteIndexService.java:68) |  
  | at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:51) |  
  | at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:828) |  
  | at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:395) |  
  | at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:235) |  
  | at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:167) |  
  | at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:146) |  
  | at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:202) |  
  | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:712) |  
  | at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) |  
  | at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) |  
  | at java.base/java.lang.Thread.run(Thread.java:833)

@stu-elastic
Copy link
Contributor

stu-elastic commented Dec 2, 2021

Another instance: https://gradle-enterprise.elastic.co/s/zmyadhq4xjppw, same trace.

@jkakavas
Copy link
Member

happened again https://gradle-enterprise.elastic.co/s/e5dxrbejgnibk, I'll mute this on master

jakelandis added a commit that referenced this issue Jan 4, 2022
This commit attempts to fix a set of Watcher tests that can
fail due to unexpected execution of Watches after Watcher has been stopped.

The theory here is that a Watch can be queued but not fully executed
then Watcher is shutdown, the test does some clean up, then the
queued Watch finishes execution and causes some some additional cleanup
to fail.

The change here ensures that when Watcher is stopped from AbstractWatcherIntegrationTestCase
that it will also wait until there are no more current Watches executing.

closes #66495
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Jan 4, 2022
This commit attempts to fix a set of Watcher tests that can
fail due to unexpected execution of Watches after Watcher has been stopped.

The theory here is that a Watch can be queued but not fully executed
then Watcher is shutdown, the test does some clean up, then the
queued Watch finishes execution and causes some some additional cleanup
to fail.

The change here ensures that when Watcher is stopped from AbstractWatcherIntegrationTestCase
that it will also wait until there are no more current Watches executing.

closes elastic#66495
elasticsearchmachine pushed a commit that referenced this issue Jan 4, 2022
This commit attempts to fix a set of Watcher tests that can
fail due to unexpected execution of Watches after Watcher has been stopped.

The theory here is that a Watch can be queued but not fully executed
then Watcher is shutdown, the test does some clean up, then the
queued Watch finishes execution and causes some some additional cleanup
to fail.

The change here ensures that when Watcher is stopped from AbstractWatcherIntegrationTestCase
that it will also wait until there are no more current Watches executing.

closes #66495
astefan pushed a commit to astefan/elasticsearch that referenced this issue Jan 7, 2022
This commit attempts to fix a set of Watcher tests that can
fail due to unexpected execution of Watches after Watcher has been stopped.

The theory here is that a Watch can be queued but not fully executed
then Watcher is shutdown, the test does some clean up, then the
queued Watch finishes execution and causes some some additional cleanup
to fail.

The change here ensures that when Watcher is stopped from AbstractWatcherIntegrationTestCase
that it will also wait until there are no more current Watches executing.

closes elastic#66495
astefan pushed a commit to astefan/elasticsearch that referenced this issue Jan 7, 2022
This commit attempts to fix a set of Watcher tests that can
fail due to unexpected execution of Watches after Watcher has been stopped.

The theory here is that a Watch can be queued but not fully executed
then Watcher is shutdown, the test does some clean up, then the
queued Watch finishes execution and causes some some additional cleanup
to fail.

The change here ensures that when Watcher is stopped from AbstractWatcherIntegrationTestCase
that it will also wait until there are no more current Watches executing.

closes elastic#66495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet