-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config,kvserver: don't unconditionally split on table boundaries for the host tenant #66063
Closed
Tracked by
#73874
Labels
A-kv-distribution
Relating to rebalancing and leasing.
A-zone-configs
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
24 tasks
#72389 (comment). Now that we can have a larger number of tables, doing something here seems worthwhile, even if gated behind a cluster setting. |
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Apr 14, 2022
Fixes cockroachdb#72389. Fixes cockroachdb#66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in cockroachdb#68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have cockroachdb#73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
Apr 27, 2022
Fixes cockroachdb#72389. Fixes cockroachdb#66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in cockroachdb#68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have cockroachdb#73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
blathers-crl bot
pushed a commit
that referenced
this issue
Apr 29, 2022
Fixes #72389. Fixes #66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in #68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have #73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
May 7, 2023
Fixes cockroachdb#81008. We built the basic infrastructure to coalesce ranges across table boundaries back in 22.2 as part of cockroachdb#66063. We've enabled this optimization for secondary tenants since then, but hadn't for the system tenant because of two primary blockers: - cockroachdb#93617: SHOW RANGES was broken by coalesced ranges. - cockroachdb#84105: APIs to compute sizes for schema objects (used in our UI) was broken by coalesced ranges. In both these cases we baked in assumptions about there being a minimum of one-{table,index,partition}-per-range. These blockers didn't apply to secondary tenants at the time since they didn't have access to SHOW RANGES, nor the UI pages where these schema statistics were displayed. We've addressed both these blockers in the 23.1 cycle as part of bridging the compatibility between secondary tenants and yesteryear's system tenant. - cockroachdb#93644 revised SHOW RANGES and crdb_internal.ranges{,_no_leases}, both internally and its external UX, to accommodate ranges straddling table/database boundaries. - cockroachdb#96223 re-worked our SpanStats API to work in the face of coalesced ranges, addressing cockroachdb#84105. Release note (general change): CockroachDB would previously use separate ranges for each table, index, or partition. This is no longer true -- it's possible now to have multiple tables, indexes, and partitions to get packed into the same range. For users with many such "schema objects", this will reduce the total range count in their clusters. This is especially true if individual tables, indexes, or partitions are smaller than the default configured maximum range size (controlled using zone configs, specifically the range_max_bytes parameter). We made this change to improve scalability with respect to the number of schema objects, since the underlying range count now no longer a bottleneck. Users upgrading from 22.2, once finalizing their upgrade, may observe a round of range merges and snapshot transfers (to power said range merges) as a result of this change. If users want to opt-out of this optimization, they can use the following cluster setting: SET CLUSTER SETTING spanconfig.storage_coalesce_adjacent.enabled = false;
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this issue
May 8, 2023
Fixes cockroachdb#81008. We built the basic infrastructure to coalesce ranges across table boundaries back in 22.2 as part of cockroachdb#66063. We've enabled this optimization for secondary tenants since then, but hadn't for the system tenant because of two primary blockers: - cockroachdb#93617: SHOW RANGES was broken by coalesced ranges. - cockroachdb#84105: APIs to compute sizes for schema objects (used in our UI) was broken by coalesced ranges. In both these cases we baked in assumptions about there being a minimum of one-{table,index,partition}-per-range. These blockers didn't apply to secondary tenants at the time since they didn't have access to SHOW RANGES, nor the UI pages where these schema statistics were displayed. We've addressed both these blockers in the 23.1 cycle as part of bridging the compatibility between secondary tenants and yesteryear's system tenant. - cockroachdb#93644 revised SHOW RANGES and crdb_internal.ranges{,_no_leases}, both internally and its external UX, to accommodate ranges straddling table/database boundaries. - cockroachdb#96223 re-worked our SpanStats API to work in the face of coalesced ranges, addressing cockroachdb#84105. Release note (general change): CockroachDB would previously use separate ranges for each table, index, or partition. This is no longer true -- it's possible now to have multiple tables, indexes, and partitions to get packed into the same range. For users with many such "schema objects", this will reduce the total range count in their clusters. This is especially true if individual tables, indexes, or partitions are smaller than the default configured maximum range size (controlled using zone configs, specifically the range_max_bytes parameter). We made this change to improve scalability with respect to the number of schema objects, since the underlying range count now no longer a bottleneck. Users upgrading from 22.2, once finalizing their upgrade, may observe a round of range merges and snapshot transfers (to power said range merges) as a result of this change. If users want to opt-out of this optimization, they can use the following cluster setting: SET CLUSTER SETTING spanconfig.storage_coalesce_adjacent.enabled = false;
craig bot
pushed a commit
that referenced
this issue
May 8, 2023
98820: spanconfig: enable range coalescing by default r=irfansharif a=irfansharif Fixes #81008. We built the basic infrastructure to coalesce ranges across table boundaries back in 22.2 as part of #66063. We've enabled this optimization for secondary tenants since then, but hadn't for the system tenant because of two primary blockers: - #93617: SHOW RANGES was broken by coalesced ranges. - #84105: APIs to compute sizes for schema objects (used in our UI) was broken by coalesced ranges. In both these cases we baked in assumptions about there being a minimum of one-{table,index,partition}-per-range. These blockers didn't apply to secondary tenants at the time since they didn't have access to SHOW RANGES, nor the UI pages where these schema statistics were displayed. We've addressed both these blockers in the 23.1 cycle as part of bridging the compatibility between secondary tenants and yesteryear's system tenant. - #93644 revised SHOW RANGES and crdb_internal.ranges{,_no_leases}, both internally and its external UX, to accommodate ranges straddling table/database boundaries. - #96223 re-worked our SpanStats API to work in the face of coalesced ranges, addressing #84105. Release note (general change): CockroachDB would previously use separate ranges for each table, index, or partition. This is no longer true -- it's possible now to have multiple tables, indexes, and partitions to get packed into the same range. For users with many such "schema objects", this will reduce the total range count in their clusters. This is especially true if individual tables, indexes, or partitions are smaller than the default configured maximum range size (controlled using zone configs, specifically the range_max_bytes parameter). We made this change to improve scalability with respect to the number of schema objects, since the underlying range count now no longer a bottleneck. Users upgrading from 22.2, once finalizing their upgrade, may observe a round of range merges and snapshot transfers (to power said range merges) as a result of this change. If users want to opt-out of this optimization, they can use the following cluster setting: ``` SET CLUSTER SETTING spanconfig.storage_coalesce_adjacent.enabled = false; ``` Co-authored-by: irfan sharif <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-distribution
Relating to rebalancing and leasing.
A-zone-configs
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Is your feature request related to a problem? Please describe.
On the host tenant, we unconditionally split on all table/index/partition boundaries. There are two reasons this might have been desirable:
(a) Performance: splitting off each table/index/partition into its own range guarantees independent range processing of the data within that table/index/partition.
(b) Diverging zone configs: if two tables have different zone configs attached to them (remember: the "unit" of what a zone config can apply to is a db/table/index/partition), that necessarily implies a split. By unconditionally splitting on table boundaries, we didn't have to write code to reason about adjacent-keys-but-from-different-tables-with-diverging-zone-configs.
Describe the solution you'd like
For (a) we could support syntax for users to specify intentional split on table boundaries #65903. For (b), we're soon going to introduce zone configs for secondary tenants. By default we don't split on table boundaries for secondary tenants due to the order-of-magnitude ranges that would create, but with secondary tenant zone configs, we're opening the door to tenant-initiated splits in the tenant keyspace. That means we're going to write code that reasons about adjacent-keys-but-from-different-tables-with-diverging-zone-configs, to determine where the necessary split points would be, so we may as well coalesce host tenant tables into single ranges when possible. If secondary indexes or groups of tables are placed on the same range, txns that span them could benefit from 1PC optimizations.
Describe alternatives you've considered
Keeping the status quo. I think this issue only really matters for on-prem customers (without multi-tenancy) where each table would today create a new range. For users with many small tables, the overhead of the very many tiny ranges might be undesirable. With tenant zone configs we'll soon be addressing the O(descriptors) bottleneck that today prevent having too many tables. Perhaps it's time to look ahead.
Jira issue: CRDB-7866
The text was updated successfully, but these errors were encountered: