Fix DataNodeRequestSender #121999

dnhatn · 2025-02-07T08:55:35Z

There are two issues in the current implementation:

We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed.
We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded.

Closes #121966

elasticsearchmachine · 2025-02-08T01:49:09Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

There are two issues in the current implementation: 1. We should use the list of shardIds from the request, rather than all targets, when removing failures for shards that have been successfully executed. 2. We should remove shardIds from the pending list once a failure is reported and abort execution at that point, as the results will be discarded. Closes elastic#121966

* Retry ES|QL node requests on shard level failures (#120774) Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node Includes #121999 Closes #121966

elasticsearchmachine added the v9.1.0 label Feb 7, 2025

dnhatn force-pushed the fix-sender branch from 2997189 to b79439d Compare February 8, 2025 01:40

dnhatn added v8.19.0 :Analytics/ES|QL AKA ESQL >non-issue labels Feb 8, 2025

dnhatn requested review from nik9000, quux00 and smalyshev February 8, 2025 01:48

dnhatn marked this pull request as ready for review February 8, 2025 01:48

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 8, 2025

Fix DataNodeRequestSender

38e6a6a

dnhatn force-pushed the fix-sender branch from b79439d to 38e6a6a Compare February 10, 2025 22:52

dnhatn enabled auto-merge (squash) February 11, 2025 00:43

dnhatn disabled auto-merge February 11, 2025 00:43

dnhatn merged commit bda99c9 into elastic:main Feb 11, 2025
16 of 17 checks passed

dnhatn deleted the fix-sender branch February 11, 2025 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataNodeRequestSender #121999

Fix DataNodeRequestSender #121999

dnhatn commented Feb 7, 2025 •

edited

Loading

elasticsearchmachine commented Feb 8, 2025

Fix DataNodeRequestSender #121999

Fix DataNodeRequestSender #121999

Conversation

dnhatn commented Feb 7, 2025 • edited Loading

elasticsearchmachine commented Feb 8, 2025

dnhatn commented Feb 7, 2025 •

edited

Loading