Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry ES|QL node requests on shard level failures #120774

Merged
merged 13 commits into from
Feb 6, 2025

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 24, 2025

Today, ES|QL fails fast on any failure. This PR introduces support for retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests: entire request failures and individual shard failures. For individual shard failures, we can retry the next copies of the failing shards. For entire request failures, we can retry every shard in the node request if no pages have been received.

On the handling side, ES|QL executes against a batch of shards concurrently. Here, we need to track whether any pages have been produced. If pages have been produced, the entire request must fail. Otherwise, we can track the failed shards and send them back to the sender for retries.

There are two decisions around how quickly we should retry:

  1. Should we notify the sender of failing shards immediately (via a different channel) to enable quick retries, or should we accumulate failures and return them in the final response?
  2. What is the maximum number of inflight requests we should allow on the sending side?

This PR considers failures often occurring when the cluster is under load or during a rolling upgrade. To prevent retries from adding more load and to allow the cluster to stabilize, this PR chooses to send shard failures in the final response and limits the number of inflight requests to one per data node

@dnhatn dnhatn force-pushed the retry-shard-failures branch 7 times, most recently from a3d93b1 to 08c2df1 Compare January 25, 2025 22:55
@dnhatn dnhatn changed the title WIP Retry ES|QL node requests on shard level failures Jan 26, 2025
@dnhatn dnhatn added v8.18.0 auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL >non-issue and removed >non-issue auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v8.18.0 labels Jan 26, 2025
@dnhatn dnhatn force-pushed the retry-shard-failures branch from 08c2df1 to f94bcfb Compare January 28, 2025 05:19
@dnhatn dnhatn added :Analytics/ES|QL AKA ESQL >enhancement v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Worth another set of eyes to be sure, but yeah. LGTM.

@nik9000
Copy link
Member

nik9000 commented Feb 3, 2025

If one of the other reviewers could approve too I'm in. If not, I can read more more.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions to help my understanding.

// remove failures of successful shards
for (ShardId shardId : targetShards.shardIds()) {
if (shardFailures.containsKey(shardId) == false) {
shardFailures.remove(shardId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is happening here. If the shardFailures map does not contain a key, you try to remove it? Isn't that backwards? Sorry if I'm missing something.

And also, is there a race condition here between the call to containsKey and remove or is the code guarantee that only one active shardId is active at a time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It should be checking the shardFailure from the response instead. Fixed 5565434, thanks!

return new ShardFailure(fatal, e);
}
if (e instanceof NoShardAvailableActionException || ExceptionsHelper.unwrap(e, TaskCancelledException.class) != null) {
return new ShardFailure(current.fatal || fatal, current.failure);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a TaskCancelledException not automatically fatal, rather than accept the fatal setting passed into the method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed b552c0d

sendRequest(request.node, request.shardIds, request.aliasFilters, new NodeListener() {
void onAfter(List<DriverProfile> profiles) {
nodePermits.get(request.node).release();
trySendingRequestsForPendingShards(targetShards, computeListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are about to send out another request after a previous one, would this be a good place to check whether the rootTask has been cancelled before doing that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do it here since we should already have this check inside TransportService before sending a child request.

@dnhatn dnhatn requested a review from quux00 February 5, 2025 22:01
@dnhatn
Copy link
Member Author

dnhatn commented Feb 6, 2025

@nik9000 @quux00 Thanks for reviewing!

@dnhatn dnhatn merged commit 2d99a66 into elastic:main Feb 6, 2025
17 checks passed
@dnhatn dnhatn deleted the retry-shard-failures branch February 6, 2025 03:01
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120774

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 6, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 6, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
@ldematte
Copy link
Contributor

ldematte commented Feb 6, 2025

@dnhatn should this be backported to 9.0 too, or is this a new feature?

@dnhatn
Copy link
Member Author

dnhatn commented Feb 6, 2025

@dnhatn should this be backported to 9.0 too, or is this a new feature?

This is a new feature. I am not sure if we should backport it to 9.0.0

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 6, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 11, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 11, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
dnhatn added a commit that referenced this pull request Feb 13, 2025
Currently, the ES|QL failure collectors categorize errors into 
non-cancellation and cancellation errors, preferring to return 
non-cancellation errors to users. With the retry on shard-level failure,
the failure collector can now collect more categories of errors: client 
errors, server errors, shard-unavailable errors, and cancellation
errors. For easier diagnostics and operations (especially on
serverless), the failure collectors prefer returning client (4xx) errors
over server (5xx) errors, shard-unavailable errors, and cancellation
errors.

Relates #120774
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 13, 2025
Currently, the ES|QL failure collectors categorize errors into
non-cancellation and cancellation errors, preferring to return
non-cancellation errors to users. With the retry on shard-level failure,
the failure collector can now collect more categories of errors: client
errors, server errors, shard-unavailable errors, and cancellation
errors. For easier diagnostics and operations (especially on
serverless), the failure collectors prefer returning client (4xx) errors
over server (5xx) errors, shard-unavailable errors, and cancellation
errors.

Relates elastic#120774
elasticsearchmachine pushed a commit that referenced this pull request Feb 13, 2025
Currently, the ES|QL failure collectors categorize errors into
non-cancellation and cancellation errors, preferring to return
non-cancellation errors to users. With the retry on shard-level failure,
the failure collector can now collect more categories of errors: client
errors, server errors, shard-unavailable errors, and cancellation
errors. For easier diagnostics and operations (especially on
serverless), the failure collectors prefer returning client (4xx) errors
over server (5xx) errors, shard-unavailable errors, and cancellation
errors.

Relates #120774
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 15, 2025
Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node
dnhatn added a commit that referenced this pull request Feb 15, 2025
* Retry ES|QL node requests on shard level failures (#120774)

Today, ES|QL fails fast on any failure. This PR introduces support for
retrying within a cluster when data-node requests fail.

There are two types of failures that occur with data-node requests:
entire request failures and individual shard failures. For individual
shard failures, we can retry the next copies of the failing shards. For
entire request failures, we can retry every shard in the node request if
no pages have been received.

On the handling side, ES|QL executes against a batch of shards
concurrently. Here, we need to track whether any pages have been
produced. If pages have been produced, the entire request must fail.
Otherwise, we can track the failed shards and send them back to the
sender for retries.

There are two decisions around how quickly we should retry:

1. Should we notify the sender of failing shards immediately (via a
different channel) to enable quick retries, or should we accumulate
failures and return them in the final response?

2. What is the maximum number of inflight requests we should allow on
the sending side?

This PR considers failures often occurring when the cluster is under
load or during a rolling upgrade. To prevent retries from adding more
load and to allow the cluster to stabilize, this PR chooses to send
shard failures in the final response and limits the number of inflight
requests to one per data node

Includes #121999

Closes #121966
dnhatn added a commit that referenced this pull request Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants