[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

elasticsearchmachine · 2025-02-19T21:50:01Z

Build Scans:

Reproduction Line:

./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm" -Dtests.seed=F2C5A2F37104E280 -Dtests.locale=fy-Latn-NL -Dtests.timezone=America/Costa_Rica -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.Exception: Test abandoned because suite timeout was reached.

Issue Reasons:

[main] 2 failures in test testShrinkIndexPrimaryTerm (0.4% fail rate in 451 executions)
[main] 2 failures in step part1 (3.5% fail rate in 57 executions)
[main] 2 failures in pipeline elasticsearch-intake (3.5% fail rate in 57 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-02-19T21:50:05Z

This has been muted on branch main

Mute Reasons:

[main] 2 failures in test testShrinkIndexPrimaryTerm (0.4% fail rate in 451 executions)
[main] 2 failures in step part1 (3.5% fail rate in 57 executions)
[main] 2 failures in pipeline elasticsearch-intake (3.5% fail rate in 57 executions)

Build Scans:

…ShrinkIndexPrimaryTerm #122974

elasticsearchmachine · 2025-02-19T21:50:26Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-02-20T00:48:41Z

The test tripped a new assertion added in #122374

Feb 19, 2025 3:16:10 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#165,elasticsearch[node_s1][write][T#2],5,TGRP-ShrinkIndexIT]
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([F2C5A2F37104E280]:0)
	at org.elasticsearch.index.engine.InternalEngine.performActionWithDirectoryReader(InternalEngine.java:3473)
	at org.elasticsearch.index.engine.InternalEngine.resolveDocVersion(InternalEngine.java:1024)
	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1339)
	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1316)
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1178)
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1085)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1011)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:929)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:391)
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:248)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:316)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:164)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:82)
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)

It seems to be a race condition between failing a shard and indexing a document. testShrinkIndexPrimaryTerm does this in a loop by directly failing the shard (shard.failShard(...)) and attempting to index a document to the shard immediately after. Since shard closing is now async, I guess it might be possible that the store is closed while the indexing request is in flight.

Tagging this with both :Distributed Indexing/Engine and :StorageEngine/TSDB based on the issue (#122374) that introduced the assertion. Also marking this with medium-risk just to be on the safe side. Please feel free to re-evaluate.

elasticsearchmachine · 2025-02-20T00:51:30Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-02-20T00:51:31Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

martijnvg · 2025-02-20T08:08:22Z

@tlrx I think we may need to invoke store.tryIncRef() / store.decRef(); as part of the performActionWithDirectoryReader(...) call. Given this test failure, I don't we can always rely on having a reference in the Store? This is different than the conclusion in this conversation. Maybe it is rare, otherwise we would have seen more failures, given that #122374 was merged roughly one week ago?

martijnvg · 2025-02-20T10:54:45Z

Maybe it is rare, otherwise we would have seen more failures, given that #122374 was merged roughly one week ago?

I think as part of InternalEngine#index(...), the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether there is a reference to the store.

…er(...) This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`). When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store. Closes elastic#122974

tlrx · 2025-02-21T10:29:46Z

@martijnvg yes, I think that in the case of the engine being failed concurrently then the store might be already closed at the time performActionWithDirectoryReader is invoked. I haven't thought of a concurrent engine failure when we discussed it last week, sorry.

martijnvg · 2025-02-21T10:49:14Z

@tlrx No worries and thanks for explaining the scenario where engine doesn't hold a store lock.

…er(...) (elastic#123010) This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`). When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store. Closes elastic#122974 * Update docs/changelog/123010.yaml

…er(...) (#123010) (#123241) This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`). When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store. Closes #122974 * Update docs/changelog/123010.yaml

…er(...) (#123010) (#123242) This method gets called from `InternalEngine#resolveDocVersion(...)`, which gets during indexing (via `InternalEngine.index(...)`). When `InternalEngine.index(...)` gets invoked, the InternalEngine only ensures that it holds a ref to the engine via Engine#acquireEnsureOpenRef(), but this doesn't ensure whether it holds a reference to the store. Closes #122974 * Update docs/changelog/123010.yaml

elasticsearchmachine added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Feb 19, 2025

elasticsearchmachine added a commit that referenced this issue Feb 19, 2025

Mute org.elasticsearch.action.admin.indices.create.ShrinkIndexIT test…

5166b81

…ShrinkIndexPrimaryTerm #122974

elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 19, 2025

elasticsearchmachine added the Team:StorageEngine label Feb 20, 2025

elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Feb 20, 2025

martijnvg self-assigned this Feb 20, 2025

martijnvg mentioned this issue Feb 20, 2025

Hold store reference in InternalEngine#performActionWithDirectoryReader(...) #123010

Merged

martijnvg closed this as completed in #123010 Feb 21, 2025

martijnvg closed this as completed in 8d1f5d3 Feb 21, 2025

martijnvg mentioned this issue Feb 24, 2025

[8.x] Hold store reference in InternalEngine#performActionWithDirectoryReader(...) #123241

Merged

martijnvg mentioned this issue Feb 24, 2025

[9.0] Hold store reference in InternalEngine#performActionWithDirectoryReader(...) #123242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

elasticsearchmachine commented Feb 19, 2025

elasticsearchmachine commented Feb 19, 2025

elasticsearchmachine commented Feb 19, 2025

ywangd commented Feb 20, 2025 •

edited

Loading

elasticsearchmachine commented Feb 20, 2025

elasticsearchmachine commented Feb 20, 2025

martijnvg commented Feb 20, 2025

martijnvg commented Feb 20, 2025

tlrx commented Feb 21, 2025

martijnvg commented Feb 21, 2025

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

[CI] ShrinkIndexIT testShrinkIndexPrimaryTerm failing #122974

Comments

elasticsearchmachine commented Feb 19, 2025

elasticsearchmachine commented Feb 19, 2025

elasticsearchmachine commented Feb 19, 2025

ywangd commented Feb 20, 2025 • edited Loading

elasticsearchmachine commented Feb 20, 2025

elasticsearchmachine commented Feb 20, 2025

martijnvg commented Feb 20, 2025

martijnvg commented Feb 20, 2025

tlrx commented Feb 21, 2025

martijnvg commented Feb 21, 2025

ywangd commented Feb 20, 2025 •

edited

Loading