Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966

Closed
elasticsearchmachine opened this issue Feb 6, 2025 · 3 comments · Fixed by #121999
Closed
Assignees
Labels
:Analytics/ES|QL AKA ESQL medium-risk An open issue or test failure that is a medium risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:test" --tests "org.elasticsearch.xpack.esql.plugin.DataNodeRequestSenderTests.testDoNotRetryOnRequestLevelFailure" -Dtests.seed=5F94CB1AA392B574 -Dtests.locale=ga-Latn-IE -Dtests.timezone=Asia/Kathmandu -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: null

Issue Reasons:

  • [main] 3 failures in test testDoNotRetryOnRequestLevelFailure (2.2% fail rate in 137 executions)
  • [main] 2 failures in step part-3 (2.7% fail rate in 75 executions)
  • [main] 2 failures in pipeline elasticsearch-pull-request (2.7% fail rate in 75 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI labels Feb 6, 2025
elasticsearchmachine added a commit that referenced this issue Feb 6, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testDoNotRetryOnRequestLevelFailure (2.2% fail rate in 137 executions)
  • [main] 2 failures in step part-3 (2.7% fail rate in 75 executions)
  • [main] 2 failures in pipeline elasticsearch-pull-request (2.7% fail rate in 75 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 6, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000 nik9000 added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 6, 2025
@nik9000
Copy link
Member

nik9000 commented Feb 6, 2025

@dnhatn I think this might be one of yours?

@dnhatn dnhatn self-assigned this Feb 6, 2025
@dnhatn dnhatn closed this as completed in bda99c9 Feb 11, 2025
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 11, 2025
There are two issues in the current implementation:

1. We should use the list of shardIds from the request, rather than all
targets, when removing failures for shards that have been successfully
executed.

2. We should remove shardIds from the pending list once a failure is reported
and abort execution at that point, as the results will be discarded.

Closes elastic#121966
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 11, 2025
There are two issues in the current implementation:

1. We should use the list of shardIds from the request, rather than all
targets, when removing failures for shards that have been successfully
executed.

2. We should remove shardIds from the pending list once a failure is reported
and abort execution at that point, as the results will be discarded.

Closes elastic#121966
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 15, 2025
There are two issues in the current implementation:

1. We should use the list of shardIds from the request, rather than all
targets, when removing failures for shards that have been successfully
executed.

2. We should remove shardIds from the pending list once a failure is reported
and abort execution at that point, as the results will be discarded.

Closes elastic#121966
dnhatn added a commit that referenced this issue Feb 15, 2025
* Retry ES|QL node requests on shard level failures (#120774)

Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node

Includes #121999

Closes #121966
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL medium-risk An open issue or test failure that is a medium risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants