-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] DataNodeRequestSenderTests testDoNotRetryOnRequestLevelFailure failing #121966
Labels
:Analytics/ES|QL
AKA ESQL
medium-risk
An open issue or test failure that is a medium risk to future releases
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
>test-failure
Triaged test failures from CI
Comments
elasticsearchmachine
added a commit
that referenced
this issue
Feb 6, 2025
…estDoNotRetryOnRequestLevelFailure #121966
This has been muted on branch main Mute Reasons:
Build Scans: |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
@dnhatn I think this might be one of yours? |
dnhatn
added a commit
to dnhatn/elasticsearch
that referenced
this issue
Feb 11, 2025
There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966
dnhatn
added a commit
to dnhatn/elasticsearch
that referenced
this issue
Feb 11, 2025
There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966
dnhatn
added a commit
to dnhatn/elasticsearch
that referenced
this issue
Feb 15, 2025
There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966
dnhatn
added a commit
that referenced
this issue
Feb 15, 2025
* Retry ES|QL node requests on shard level failures (#120774) Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node Includes #121999 Closes #121966
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Analytics/ES|QL
AKA ESQL
medium-risk
An open issue or test failure that is a medium risk to future releases
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
>test-failure
Triaged test failures from CI
Build Scans:
Reproduction Line:
Applicable branches:
main
Reproduces locally?:
N/A
Failure History:
See dashboard
Failure Message:
Issue Reasons:
Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.
The text was updated successfully, but these errors were encountered: