Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

Closed
adityamaru opened this issue Nov 8, 2023 · 3 comments · Fixed by #114537
Assignees
Labels
A-jobs branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-jobs

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Nov 8, 2023

Bulk jobs and SQL queries share the same underlying infrastructure to partition a set of spans and find appropriate nodes based on the passed in configurations. In https://github.com/cockroachlabs/support/issues/2659 we noticed a case where a BACKUP WITH EXECUTION LOCALITY could end up getting all the spans handed to the gateway node, causing a severe degradation in performance.

In the escalation above we saw that the oracle that determines the node to send a span to returned instances that did not match our locality filter. This is not on its own a bug in the oracle because there could be several reasons such as no replicas in the region with the locality filter, or txn timestamps not being old enough to be served by follower reads. Since none of our eligible instances could be picked, we ended up choosing the gateway for every span -

return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_NO_LOCALITY_MATCH
.

This bias towards the gateway makes sense for most OLTP SQL queries as explained by @rharding6373 and @yuzefovich over here - https://cockroachlabs.slack.com/archives/C01RX2G8LT1/p1699395502272229?thread_ts=1699393510.164829&cid=C01RX2G8LT1. Bulk jobs (backup, restore, CDC, PCR) however rely entirely on this partitioning to divide their work, and do not have an external load balancer they can rely on. Presumably, OLAP queries with thousands of spans would run into a similar hotspot problem.

A proposed solution would be to add a heuristic that defaults to the gateway up until some threshold, after which we randomly distribute the remaining spans to other healthy instances in our set of eligible instances. In an ideal world we'd have a (1/n) * number of spans division of work where n = number of healthy, eligible instances, so we could say that if the gateway has been assigned 3 * (1 / n) spans because of no better choice, we start round robin assigning to other eligible instances.

Jira issue: CRDB-33325

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team T-jobs labels Nov 8, 2023
@blathers-crl blathers-crl bot added the A-jobs label Nov 8, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 8, 2023
@adityamaru adityamaru changed the title sql: PartitionSpans for bulk jobs can lead to a hotspot under some circumstances sql: PartitionSpans for a large number of spans can result in a node hotspot Nov 8, 2023
@adityamaru adityamaru changed the title sql: PartitionSpans for a large number of spans can result in a node hotspot sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway Nov 8, 2023
@miretskiy
Copy link
Contributor

I want to chime in here -- it's not only DR jobs that are impacted, but also CDC (and we also have escalations around this issue)

@adityamaru adityamaru self-assigned this Nov 10, 2023
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Nov 14, 2023
@nicktrav nicktrav added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 15, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Nov 16, 2023
Previously, the instance resoler would always
assign the partition span to the gateway if the
gateway was in the set of eligible instances and
we did not find an eligible instance with a better
locality match. In large clusters during backup/cdc
running with execution locality, this could cause
the gateway to get the lions share of work thereby
causing it to OOM or severely throttle performance.

This change make span partitioning a little more
stateful. Concretely, we now track how many partition
spans have been assigned to each node in the `planCtx`
that is used throughout the planning of a single statement.
This distribution is then used to limit the number of
partition spans we default to the gateway. Currently, by
default we allow the gateway to have:

`2 * average number of partition spans across the other instances`

If the gateway does not satisfy this heuristic we randomly
pick one of the other eligible instances. Note, if there
are no eligible instances except for the gateway, or the
gateway has received no spans yet, we will pick the gateway.

Fixes: cockroachdb#114079
Release note (bug fix): fixes a bug where large jobs running
with execution locality could result in the gateway being assigned
most of the work causing performance degradation and cluster
instability
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Nov 21, 2023
craig bot pushed a commit that referenced this issue Nov 22, 2023
114537: sql: prevent gateway from always being picked as the default r=adityamaru a=adityamaru

Previously, the instance resolver would always
assign the partition span to the gateway if the
gateway was in the set of eligible instances and
we did not find an eligible instance with a better
locality match. In large clusters during backup/cdc
running with execution locality, this could cause
the gateway to get the lions share of work thereby
causing it to OOM or severely throttle performance.

This change make span partitioning a little more
stateful. Concretely, we now track how many partition
spans have been assigned to each node in the `planCtx`
that is used throughout the planning of a single statement.
This distribution is then used to limit the number of
partition spans we default to the gateway. Currently, by
default we allow the gateway to have:

`2 * average number of partition spans across the other instances`

If the gateway does not satisfy this heuristic we randomly
pick one of the other eligible instances. Note, if there
are no eligible instances except for the gateway, or the
gateway has received no spans yet, we will pick the gateway.

Fixes: #114079
Release note (bug fix): fixes a bug where large jobs running
with execution locality could result in the gateway being assigned
most of the work causing performance degradation and cluster
instability

Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in 51fd26f Nov 22, 2023
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Nov 22, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Dec 1, 2023
Previously, the instance resoler would always
assign the partition span to the gateway if the
gateway was in the set of eligible instances and
we did not find an eligible instance with a better
locality match. In large clusters during backup/cdc
running with execution locality, this could cause
the gateway to get the lions share of work thereby
causing it to OOM or severely throttle performance.

This change make span partitioning a little more
stateful. Concretely, we now track how many partition
spans have been assigned to each node in the `planCtx`
that is used throughout the planning of a single statement.
This distribution is then used to limit the number of
partition spans we default to the gateway. Currently, by
default we allow the gateway to have:

`2 * average number of partition spans across the other instances`

If the gateway does not satisfy this heuristic we randomly
pick one of the other eligible instances. Note, if there
are no eligible instances except for the gateway, or the
gateway has received no spans yet, we will pick the gateway.

Fixes: cockroachdb#114079
Release note (bug fix): fixes a bug where large jobs running
with execution locality could result in the gateway being assigned
most of the work causing performance degradation and cluster
instability
@adityamaru adityamaru added GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Dec 1, 2023
@adityamaru
Copy link
Contributor Author

reopening until the backport lands

@adityamaru adityamaru reopened this Dec 1, 2023
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Dec 1, 2023
@yuzefovich yuzefovich removed this from SQL Queries Dec 1, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Dec 7, 2023
Previously, the instance resoler would always
assign the partition span to the gateway if the
gateway was in the set of eligible instances and
we did not find an eligible instance with a better
locality match. In large clusters during backup/cdc
running with execution locality, this could cause
the gateway to get the lions share of work thereby
causing it to OOM or severely throttle performance.

This change make span partitioning a little more
stateful. Concretely, we now track how many partition
spans have been assigned to each node in the `planCtx`
that is used throughout the planning of a single statement.
This distribution is then used to limit the number of
partition spans we default to the gateway. Currently, by
default we allow the gateway to have:

`2 * average number of partition spans across the other instances`

If the gateway does not satisfy this heuristic we randomly
pick one of the other eligible instances. Note, if there
are no eligible instances except for the gateway, or the
gateway has received no spans yet, we will pick the gateway.

This change also adds a new session variable `distsql_plan_gateway_bias`
to control how many times the gateway will be picked as the default
target for a partition relative to the distribution of
partition spans across other nodes.

Fixes: cockroachdb#114079
Release note (bug fix): fixes a bug where large jobs running
with execution locality could result in the gateway being assigned
most of the work causing performance degradation and cluster
instability
@adityamaru
Copy link
Contributor Author

behaviour has been backported to 23.2

blathers-crl bot pushed a commit that referenced this issue Dec 8, 2023
Previously, the instance resoler would always
assign the partition span to the gateway if the
gateway was in the set of eligible instances and
we did not find an eligible instance with a better
locality match. In large clusters during backup/cdc
running with execution locality, this could cause
the gateway to get the lions share of work thereby
causing it to OOM or severely throttle performance.

This change make span partitioning a little more
stateful. Concretely, we now track how many partition
spans have been assigned to each node in the `planCtx`
that is used throughout the planning of a single statement.
This distribution is then used to limit the number of
partition spans we default to the gateway. Currently, by
default we allow the gateway to have:

`2 * average number of partition spans across the other instances`

If the gateway does not satisfy this heuristic we randomly
pick one of the other eligible instances. Note, if there
are no eligible instances except for the gateway, or the
gateway has received no spans yet, we will pick the gateway.

This change also adds a new session variable `distsql_plan_gateway_bias`
to control how many times the gateway will be picked as the default
target for a partition relative to the distribution of
partition spans across other nodes.

Fixes: #114079
Release note (bug fix): fixes a bug where large jobs running
with execution locality could result in the gateway being assigned
most of the work causing performance degradation and cluster
instability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-jobs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants