Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

Closed
elasticsearchmachine opened this issue Feb 19, 2025 · 9 comments · Fixed by #123010
Closed

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

elasticsearchmachine opened this issue Feb 19, 2025 · 9 comments · Fixed by #123010
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. medium-risk An open issue or test failure that is a medium risk to future releases :StorageEngine/TSDB You know, for Metrics Team:Distributed Indexing Meta label for Distributed Indexing team Team:StorageEngine >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm" -Dtests.seed=F2C5A2F37104E280 -Dtests.locale=fy-Latn-NL -Dtests.timezone=America/Costa_Rica -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.Exception: Test abandoned because suite timeout was reached.

Issue Reasons:

  • [main] 2 failures in test testShrinkIndexPrimaryTerm (0.4% fail rate in 451 executions)
  • [main] 2 failures in step part1 (3.5% fail rate in 57 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (3.5% fail rate in 57 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Feb 19, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testShrinkIndexPrimaryTerm (0.4% fail rate in 451 executions)
  • [main] 2 failures in step part1 (3.5% fail rate in 57 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (3.5% fail rate in 57 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 19, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member

ywangd commented Feb 20, 2025

The test tripped a new assertion added in #122374

Feb 19, 2025 3:16:10 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#165,elasticsearch[node_s1][write][T#2],5,TGRP-ShrinkIndexIT]
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([F2C5A2F37104E280]:0)
	at org.elasticsearch.index.engine.InternalEngine.performActionWithDirectoryReader(InternalEngine.java:3473)
	at org.elasticsearch.index.engine.InternalEngine.resolveDocVersion(InternalEngine.java:1024)
	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1339)
	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1316)
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1178)
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1085)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1011)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:929)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:391)
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:248)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:316)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:164)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:82)
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)

It seems to be a race condition between failing a shard and indexing a document. testShrinkIndexPrimaryTerm does this in a loop by directly failing the shard (shard.failShard(...)) and attempting to index a document to the shard immediately after. Since shard closing is now async, I guess it might be possible that the store is closed while the indexing request is in flight.

Tagging this with both :Distributed Indexing/Engine and :StorageEngine/TSDB based on the issue (#122374) that introduced the assertion. Also marking this with medium-risk just to be on the safe side. Please feel free to re-evaluate.

@ywangd ywangd added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :StorageEngine/TSDB You know, for Metrics medium-risk An open issue or test failure that is a medium risk to future releases and removed :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 20, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 20, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@martijnvg martijnvg self-assigned this Feb 20, 2025
@martijnvg
Copy link
Member

@tlrx I think we may need to invoke store.tryIncRef() / store.decRef(); as part of the performActionWithDirectoryReader(...) call. Given this test failure, I don't we can always rely on having a reference in the Store? This is different than the conclusion in this conversation. Maybe it is rare, otherwise we would have seen more failures, given that #122374 was merged roughly one week ago?

@martijnvg
Copy link
Member

Maybe it is rare, otherwise we would have seen more failures, given that #122374 was merged roughly one week ago?

I think as part of InternalEngine#index(...), the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether there is a reference to the store.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 20, 2025
…er(...)

This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`).

When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store.

Closes elastic#122974
@tlrx
Copy link
Member

tlrx commented Feb 21, 2025

@martijnvg yes, I think that in the case of the engine being failed concurrently then the store might be already closed at the time performActionWithDirectoryReader is invoked. I haven't thought of a concurrent engine failure when we discussed it last week, sorry.

@martijnvg
Copy link
Member

@tlrx No worries and thanks for explaining the scenario where engine doesn't hold a store lock.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 24, 2025
…er(...) (elastic#123010)

This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`).

When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store.

Closes elastic#122974

* Update docs/changelog/123010.yaml
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 24, 2025
…er(...) (elastic#123010)

This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`).

When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store.

Closes elastic#122974

* Update docs/changelog/123010.yaml
elasticsearchmachine pushed a commit that referenced this issue Feb 24, 2025
…er(...) (#123010) (#123241)

This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`).

When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store.

Closes #122974

* Update docs/changelog/123010.yaml
elasticsearchmachine pushed a commit that referenced this issue Feb 24, 2025
…er(...) (#123010) (#123242)

This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`).

When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store.

Closes #122974

* Update docs/changelog/123010.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. medium-risk An open issue or test failure that is a medium risk to future releases :StorageEngine/TSDB You know, for Metrics Team:Distributed Indexing Meta label for Distributed Indexing team Team:StorageEngine >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants