[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

elasticsearchmachine · 2025-02-06T21:40:15Z

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:test" --tests "org.elasticsearch.xpack.esql.plugin.DataNodeRequestSenderTests.testDoNotRetryOnRequestLevelFailure" -Dtests.seed=5F94CB1AA392B574 -Dtests.locale=ga-Latn-IE -Dtests.timezone=Asia/Kathmandu -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: null

Issue Reasons:

[main] 3 failures in test testDoNotRetryOnRequestLevelFailure (2.2% fail rate in 137 executions)
[main] 2 failures in step part-3 (2.7% fail rate in 75 executions)
[main] 2 failures in pipeline elasticsearch-pull-request (2.7% fail rate in 75 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

…estDoNotRetryOnRequestLevelFailure #121966

elasticsearchmachine · 2025-02-06T21:40:24Z

This has been muted on branch main

Mute Reasons:

[main] 3 failures in test testDoNotRetryOnRequestLevelFailure (2.2% fail rate in 137 executions)
[main] 2 failures in step part-3 (2.7% fail rate in 75 executions)
[main] 2 failures in pipeline elasticsearch-pull-request (2.7% fail rate in 75 executions)

Build Scans:

elasticsearchmachine · 2025-02-06T21:40:39Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2025-02-06T21:47:57Z

@dnhatn I think this might be one of yours?

There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966

* Retry ES|QL node requests on shard level failures (#120774) Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node Includes #121999 Closes #121966

elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI labels Feb 6, 2025

elasticsearchmachine added a commit that referenced this issue Feb 6, 2025

Mute org.elasticsearch.xpack.esql.plugin.DataNodeRequestSenderTests t…

e24489f

…estDoNotRetryOnRequestLevelFailure #121966

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 6, 2025

nik9000 added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 6, 2025

dnhatn self-assigned this Feb 6, 2025

dnhatn mentioned this issue Feb 8, 2025

Fix DataNodeRequestSender #121999

Merged

dnhatn closed this as completed in #121999 Feb 11, 2025

dnhatn closed this as completed in bda99c9 Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

elasticsearchmachine commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

nik9000 commented Feb 6, 2025

[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

Comments

elasticsearchmachine commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

nik9000 commented Feb 6, 2025