Retry ES|QL node requests on shard level failures #120774

dnhatn · 2025-01-24T08:07:30Z

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received.

On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries.

There are two decisions around how quickly we should retry:

Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response?
What is the maximum number of inflight requests we should allow on the sending side?

This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

elasticsearchmachine · 2025-01-28T06:01:52Z

Hi @dnhatn, I've created a changelog YAML for you.

nik9000

LGTM. Worth another set of eyes to be sure, but yeah. LGTM.

nik9000 · 2025-02-03T20:03:20Z

If one of the other reviewers could approve too I'm in. If not, I can read more more.

quux00

Left some questions to help my understanding.

quux00 · 2025-02-05T19:07:44Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

+                // remove failures of successful shards
+                for (ShardId shardId : targetShards.shardIds()) {
+                    if (shardFailures.containsKey(shardId) == false) {
+                        shardFailures.remove(shardId);


I don't understand what is happening here. If the shardFailures map does not contain a key, you try to remove it? Isn't that backwards? Sorry if I'm missing something.

And also, is there a race condition here between the call to containsKey and remove or is the code guarantee that only one active shardId is active at a time?

Good catch! It should be checking the shardFailure from the response instead. Fixed 5565434, thanks!

quux00 · 2025-02-05T19:35:50Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

+                return new ShardFailure(fatal, e);
+            }
+            if (e instanceof NoShardAvailableActionException || ExceptionsHelper.unwrap(e, TaskCancelledException.class) != null) {
+                return new ShardFailure(current.fatal || fatal, current.failure);


Why is a TaskCancelledException not automatically fatal, rather than accept the fatal setting passed into the method?

I pushed b552c0d

quux00 · 2025-02-05T19:40:23Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

+        sendRequest(request.node, request.shardIds, request.aliasFilters, new NodeListener() {
+            void onAfter(List<DriverProfile> profiles) {
+                nodePermits.get(request.node).release();
+                trySendingRequestsForPendingShards(targetShards, computeListener);


Since we are about to send out another request after a previous one, would this be a good place to check whether the rootTask has been cancelled before doing that?

We don't need to do it here since we should already have this check inside TransportService before sending a child request.

dnhatn · 2025-02-06T03:00:05Z

@nik9000 @quux00 Thanks for reviewing!

elasticsearchmachine · 2025-02-06T03:02:25Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120774

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

ldematte · 2025-02-06T14:50:14Z

@dnhatn should this be backported to 9.0 too, or is this a new feature?

dnhatn · 2025-02-06T16:24:38Z

@dnhatn should this be backported to 9.0 too, or is this a new feature?

This is a new feature. I am not sure if we should backport it to 9.0.0

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

Currently, the ES|QL failure collectors categorize errors into non-cancellation and cancellation errors, preferring to return non-cancellation errors to users. With the retry on shard-level failure, the failure collector can now collect more categories of errors: client errors, server errors, shard-unavailable errors, and cancellation errors. For easier diagnostics and operations (especially on serverless), the failure collectors prefer returning client (4xx) errors over server (5xx) errors, shard-unavailable errors, and cancellation errors. Relates #120774

Currently, the ES|QL failure collectors categorize errors into non-cancellation and cancellation errors, preferring to return non-cancellation errors to users. With the retry on shard-level failure, the failure collector can now collect more categories of errors: client errors, server errors, shard-unavailable errors, and cancellation errors. For easier diagnostics and operations (especially on serverless), the failure collectors prefer returning client (4xx) errors over server (5xx) errors, shard-unavailable errors, and cancellation errors. Relates elastic#120774

Currently, the ES|QL failure collectors categorize errors into non-cancellation and cancellation errors, preferring to return non-cancellation errors to users. With the retry on shard-level failure, the failure collector can now collect more categories of errors: client errors, server errors, shard-unavailable errors, and cancellation errors. For easier diagnostics and operations (especially on serverless), the failure collectors prefer returning client (4xx) errors over server (5xx) errors, shard-unavailable errors, and cancellation errors. Relates #120774

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

* Retry ES|QL node requests on shard level failures (#120774) Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail. There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received. On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries. There are two decisions around how quickly we should retry: 1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response? 2. What is the maximum number of inflight requests we should allow on the sending side? This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node Includes #121999 Closes #121966

Relates ##120774

elasticsearchmachine added the v9.0.0 label Jan 24, 2025

dnhatn force-pushed the retry-shard-failures branch 7 times, most recently from a3d93b1 to 08c2df1 Compare January 25, 2025 22:55

dnhatn changed the title ~~WIP~~ Retry ES|QL node requests on shard level failures Jan 26, 2025

dnhatn added v8.18.0 auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL >non-issue and removed >non-issue auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v8.18.0 labels Jan 26, 2025

dnhatn added 3 commits January 27, 2025 21:19

Return shard level failures in node response

6c67d6a

Refactor sender

6e5250e

Sender with retry

f94bcfb

dnhatn force-pushed the retry-shard-failures branch from 08c2df1 to f94bcfb Compare January 28, 2025 05:19

dnhatn added :Analytics/ES|QL AKA ESQL >enhancement v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 28, 2025

Update docs/changelog/120774.yaml

aa36980

dnhatn requested review from nik9000, idegtiarenko and quux00 January 28, 2025 06:13

fix test

bf217d5

nik9000 approved these changes Feb 3, 2025

View reviewed changes

quux00 reviewed Feb 5, 2025

View reviewed changes

dnhatn added 3 commits February 5, 2025 13:36

oops

5565434

merge fatal

b552c0d

Merge remote-tracking branch 'elastic/main' into retry-shard-failures

b0ec173

dnhatn requested a review from quux00 February 5, 2025 22:01

dnhatn merged commit 2d99a66 into elastic:main Feb 6, 2025
17 checks passed

dnhatn deleted the retry-shard-failures branch February 6, 2025 03:01

elasticsearchmachine added the backport pending label Feb 6, 2025

dnhatn mentioned this pull request Feb 12, 2025

Prefer client errors while collecting ES|QL failures #122290

Merged

dnhatn mentioned this pull request Feb 13, 2025

Prefer client errors while collecting ES|QL failures (#122290) #122463

Merged

dnhatn mentioned this pull request Feb 15, 2025

Adjust bwc for ES|QL retry on failures #122685

Merged

dnhatn removed the backport pending label Feb 15, 2025

dnhatn added a commit that referenced this pull request Feb 15, 2025

Adjust bwc for ES|QL retry on failures (#122685)

ced440a

Relates ##120774

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry ES|QL node requests on shard level failures #120774

Retry ES|QL node requests on shard level failures #120774

dnhatn commented Jan 24, 2025 •

edited

Loading

elasticsearchmachine commented Jan 28, 2025

nik9000 left a comment

nik9000 commented Feb 3, 2025

quux00 left a comment

quux00 Feb 5, 2025

dnhatn Feb 5, 2025

quux00 Feb 5, 2025

dnhatn Feb 5, 2025

quux00 Feb 5, 2025

dnhatn Feb 5, 2025

dnhatn commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

ldematte commented Feb 6, 2025

dnhatn commented Feb 6, 2025

Retry ES|QL node requests on shard level failures #120774

Retry ES|QL node requests on shard level failures #120774

Conversation

dnhatn commented Jan 24, 2025 • edited Loading

elasticsearchmachine commented Jan 28, 2025

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 commented Feb 3, 2025

quux00 left a comment

Choose a reason for hiding this comment

quux00 Feb 5, 2025

Choose a reason for hiding this comment

dnhatn Feb 5, 2025

Choose a reason for hiding this comment

quux00 Feb 5, 2025

Choose a reason for hiding this comment

dnhatn Feb 5, 2025

Choose a reason for hiding this comment

quux00 Feb 5, 2025

Choose a reason for hiding this comment

dnhatn Feb 5, 2025

Choose a reason for hiding this comment

dnhatn commented Feb 6, 2025

elasticsearchmachine commented Feb 6, 2025

💔 Backport failed

ldematte commented Feb 6, 2025

dnhatn commented Feb 6, 2025

dnhatn commented Jan 24, 2025 •

edited

Loading