-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: PartitionSpans for a large number of spans can result in a node hotspot on the gateway #114079
Labels
A-jobs
branch-release-23.2
Used to mark GA and release blockers, technical advisories, and bugs for 23.2
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
GA-blocker
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-jobs
Comments
I want to chime in here -- it's not only DR jobs that are impacted, but also CDC (and we also have escalations around this issue) |
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Nov 16, 2023
Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability
craig bot
pushed a commit
that referenced
this issue
Nov 22, 2023
114537: sql: prevent gateway from always being picked as the default r=adityamaru a=adityamaru Previously, the instance resolver would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: #114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability Co-authored-by: adityamaru <[email protected]>
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Dec 1, 2023
Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability
reopening until the backport lands |
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Dec 7, 2023
Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. This change also adds a new session variable `distsql_plan_gateway_bias` to control how many times the gateway will be picked as the default target for a partition relative to the distribution of partition spans across other nodes. Fixes: cockroachdb#114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability
behaviour has been backported to 23.2 |
blathers-crl bot
pushed a commit
that referenced
this issue
Dec 8, 2023
Previously, the instance resoler would always assign the partition span to the gateway if the gateway was in the set of eligible instances and we did not find an eligible instance with a better locality match. In large clusters during backup/cdc running with execution locality, this could cause the gateway to get the lions share of work thereby causing it to OOM or severely throttle performance. This change make span partitioning a little more stateful. Concretely, we now track how many partition spans have been assigned to each node in the `planCtx` that is used throughout the planning of a single statement. This distribution is then used to limit the number of partition spans we default to the gateway. Currently, by default we allow the gateway to have: `2 * average number of partition spans across the other instances` If the gateway does not satisfy this heuristic we randomly pick one of the other eligible instances. Note, if there are no eligible instances except for the gateway, or the gateway has received no spans yet, we will pick the gateway. This change also adds a new session variable `distsql_plan_gateway_bias` to control how many times the gateway will be picked as the default target for a partition relative to the distribution of partition spans across other nodes. Fixes: #114079 Release note (bug fix): fixes a bug where large jobs running with execution locality could result in the gateway being assigned most of the work causing performance degradation and cluster instability
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-jobs
branch-release-23.2
Used to mark GA and release blockers, technical advisories, and bugs for 23.2
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
GA-blocker
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-jobs
Bulk jobs and SQL queries share the same underlying infrastructure to partition a set of spans and find appropriate nodes based on the passed in configurations. In https://github.com/cockroachlabs/support/issues/2659 we noticed a case where a
BACKUP WITH EXECUTION LOCALITY
could end up getting all the spans handed to the gateway node, causing a severe degradation in performance.In the escalation above we saw that the oracle that determines the node to send a span to returned instances that did not match our locality filter. This is not on its own a bug in the oracle because there could be several reasons such as no replicas in the region with the locality filter, or txn timestamps not being old enough to be served by follower reads. Since none of our eligible instances could be picked, we ended up choosing the gateway for every span -
cockroach/pkg/sql/distsql_physical_planner.go
Line 1639 in 94f3605
This bias towards the gateway makes sense for most OLTP SQL queries as explained by @rharding6373 and @yuzefovich over here - https://cockroachlabs.slack.com/archives/C01RX2G8LT1/p1699395502272229?thread_ts=1699393510.164829&cid=C01RX2G8LT1. Bulk jobs (backup, restore, CDC, PCR) however rely entirely on this partitioning to divide their work, and do not have an external load balancer they can rely on. Presumably, OLAP queries with thousands of spans would run into a similar hotspot problem.
A proposed solution would be to add a heuristic that defaults to the gateway up until some threshold, after which we randomly distribute the remaining spans to other healthy instances in our set of eligible instances. In an ideal world we'd have a
(1/n) * number of spans
division of work where n = number of healthy, eligible instances, so we could say that if the gateway has been assigned 3 * (1 / n) spans because of no better choice, we start round robin assigning to other eligible instances.Jira issue: CRDB-33325
The text was updated successfully, but these errors were encountered: