sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

adityamaru · 2023-11-08T19:45:17Z

Bulk jobs and SQL queries share the same underlying infrastructure to partition a set of spans and find appropriate nodes based on the passed in configurations. In https://github.com/cockroachlabs/support/issues/2659 we noticed a case where a BACKUP WITH EXECUTION LOCALITY could end up getting all the spans handed to the gateway node, causing a severe degradation in performance.

In the escalation above we saw that the oracle that determines the node to send a span to returned instances that did not match our locality filter. This is not on its own a bug in the oracle because there could be several reasons such as no replicas in the region with the locality filter, or txn timestamps not being old enough to be served by follower reads. Since none of our eligible instances could be picked, we ended up choosing the gateway for every span -

cockroach/pkg/sql/distsql_physical_planner.go

Line 1639 in 94f3605

return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_NO_LOCALITY_MATCH

.

This bias towards the gateway makes sense for most OLTP SQL queries as explained by @rharding6373 and @yuzefovich over here - https://cockroachlabs.slack.com/archives/C01RX2G8LT1/p1699395502272229?thread_ts=1699393510.164829&cid=C01RX2G8LT1. Bulk jobs (backup, restore, CDC, PCR) however rely entirely on this partitioning to divide their work, and do not have an external load balancer they can rely on. Presumably, OLAP queries with thousands of spans would run into a similar hotspot problem.

A proposed solution would be to add a heuristic that defaults to the gateway up until some threshold, after which we randomly distribute the remaining spans to other healthy instances in our set of eligible instances. In an ideal world we'd have a (1/n) * number of spans division of work where n = number of healthy, eligible instances, so we could say that if the gateway has been assigned 3 * (1 / n) spans because of no better choice, we start round robin assigning to other eligible instances.

Jira issue: CRDB-33325

The text was updated successfully, but these errors were encountered:

miretskiy · 2023-11-08T20:08:56Z

I want to chime in here -- it's not only DR jobs that are impacted, but also CDC (and we also have escalations around this issue)

Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability

114537: sql: prevent gateway from always being picked as the default r=adityamaru a=adityamaru Previously, the instance resolver would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: #114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability Co-authored-by: adityamaru <[email protected]>

Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability

adityamaru · 2023-12-01T00:56:47Z

reopening until the backport lands

Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. This change also adds a new session variable `distsql_plan_gateway_bias` to control how many times the gateway will be picked as the default target for a partition relative to the distribution of partition spans across other nodes. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability

adityamaru · 2023-12-07T15:34:39Z

behaviour has been backported to 23.2

Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. This change also adds a new session variable `distsql_plan_gateway_bias` to control how many times the gateway will be picked as the default target for a partition relative to the distribution of partition spans across other nodes. Fixes: #114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team T-jobs labels Nov 8, 2023

blathers-crl bot added the A-jobs label Nov 8, 2023

cockroach-teamcity added this to SQL Queries Nov 8, 2023

github-project-automation bot moved this to Triage in SQL Queries Nov 8, 2023

adityamaru changed the title ~~sql: PartitionSpans for bulk jobs can lead to a hotspot under some circumstances~~ sql: PartitionSpans for a large number of spans can result in a node hotspot Nov 8, 2023

adityamaru changed the title ~~sql: PartitionSpans for a large number of spans can result in a node hotspot~~ sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway Nov 8, 2023

adityamaru self-assigned this Nov 10, 2023

adityamaru mentioned this issue Nov 10, 2023

sqlschema: use job helper function to replan in a multi tenant environment #83589

Closed

yuzefovich moved this from Triage to Active in SQL Queries Nov 14, 2023

nicktrav added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 15, 2023

miretskiy mentioned this issue Nov 16, 2023

changefeedccl: Improve distsql planning confidence #113968

Closed

adityamaru mentioned this issue Nov 16, 2023

sql: prevent gateway from always being picked as the default #114537

Merged

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Nov 21, 2023

craig bot closed this as completed in 51fd26f Nov 22, 2023

github-project-automation bot moved this from Active to Done in SQL Queries Nov 22, 2023

adityamaru mentioned this issue Dec 1, 2023

release-23.2: sql: prevent gateway from always being picked as the default #115388

Merged

adityamaru added GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Dec 1, 2023

adityamaru reopened this Dec 1, 2023

github-project-automation bot moved this from Done to Triage in SQL Queries Dec 1, 2023

yuzefovich removed this from SQL Queries Dec 1, 2023

adityamaru closed this as completed Dec 7, 2023

blathers-crl bot mentioned this issue Dec 8, 2023

release-23.2.0-rc: release-23.2: sql: prevent gateway from always being picked as the default #115876

Merged

andyyang890 mentioned this issue May 29, 2024

changefeedccl: setting execution_locality to a secondary region could create unbalanced plans #124822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

adityamaru commented Nov 8, 2023 •

edited

Loading

miretskiy commented Nov 8, 2023

adityamaru commented Dec 1, 2023

adityamaru commented Dec 7, 2023

sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079

Comments

adityamaru commented Nov 8, 2023 • edited Loading

miretskiy commented Nov 8, 2023

adityamaru commented Dec 1, 2023

adityamaru commented Dec 7, 2023

adityamaru commented Nov 8, 2023 •

edited

Loading