-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfig: limit # of tenant span configs #77337
Conversation
8004991
to
fe3b095
Compare
e7a5752
to
0fa1e12
Compare
0fa1e12
to
9530af9
Compare
The earlier implementation relied on decoding keys in order to determine precise split points. When plugging this library into the component designed for tenant-side span config limiting (cockroachdb#77337, to address \cockroachdb#70555), we realized it's not possible to grab a hold of type-hydrated table descriptors (needed for decoding). This is because today it's possible to GC type descriptors before GC-ing table descriptors that refer to them. Given the integration point for this library was around the GC job, we had to forego decoding routines. Instead of computing the precise split keys, we can however compute how many there are without having to look at keys at all. We Consider our table hierarchy: table -> index -> partition -> partition -> (...) -> partition Each partition is either a PARTITION BY LIST kind (where it can then be further partitioned, or not), or a PARTITION BY RANGE kind (no further partitioning possible). We can classify each parent-child link into two types: (a) Contiguous {index, list partition} -> range partition (b) Non-contiguous table -> index, {index, list partition} -> list partition - Contiguous links are the sort where each child span is contiguous with another, and that the set of all child spans encompass the parent's span. For an index that's partitioned by range: CREATE TABLE db.range(i INT PRIMARY KEY, j INT) PARTITION BY RANGE (i) ( PARTITION less_than_five VALUES FROM (minvalue) to (5), PARTITION between_five_and_ten VALUES FROM (5) to (10), PARTITION greater_than_ten VALUES FROM (10) to (maxvalue) ); With table ID as 106, the parent index span is /Table/106/{1-2}. The child spans are /Table/106/1{-/5}, /Table/106/1/{5-10} and /Table/106/{1/10-2}. They're contiguous; put together they wholly encompass the parent span. - Non-contiguous links, by contrast, are when child spans are neither contiguous with respect to one another, nor do they start and end at the parent span's boundaries. For a table with a secondary index: CREATE TABLE db.t(i INT PRIMARY KEY, j INT); CREATE INDEX idx ON db.t (j); DROP INDEX db.t@idx; CREATE INDEX idx ON db.t (j); With table ID as 106, the parent table span is /Table/10{6-7}. The child spans are /Table/106/{1-2} and /Table/106/{3-4}. Compared to the parent span, we're missing /Table/106{-/1}, /Table/106/{2-3}, /Table/10{6/4-7}. For N children: - For a contiguous link, the number of splits equals the number of child elements (i.e. N). - For a non-contiguous link, the number of splits equals N + 1 + N. For N children, there are N - 1 gaps. There are also 2 gaps at the start and end of the parent span. Summing that with the N children span themselves, we get to the formula above. This assumes that the N child elements aren't further subdivided, if they are (we can compute it recursively), the formula becomes N + 1 + Σ(grand child spans). Computing split count this way does come with one down-side: we might be overcounting. When comparing keys, we could for example recognize that partition-by-list values are adjacent, therefore there's no "gap" between them. We can also do this by potentially comparing encoded keys with one other. We just didn't because it's more annoying code to write, and over-counting here is more than fine for our purposes usages. Release justification: non-production code Release note: None
73583d2
to
65a4d86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)
pkg/sql/catalog/tabledesc/structured.go, line 53 at r1 (raw file):
Previously, ajwerner wrote…
I'd rather you just had the one. If you're sad that
ClusterVersion
is a field as opposed to a method, feel free to change it to be a method that exposes a value stored in an unexported field. I don't like having both.
Done.
CI failure is an unrelated Google Cloud Storage flake, we've been playing around with stuff in TC I hear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a tenant logic test that shows the end-to-end behavior working as expected?
Reviewed 10 of 52 files at r1, 11 of 23 files at r2, 1 of 11 files at r4, 22 of 30 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @irfansharif)
pkg/spanconfig/spanconfig.go, line 494 at r5 (raw file):
} func isNil(table catalog.TableDescriptor) bool {
this is upsetting, when do you find a value being returned for which the nil-check doesn't work because it's a non-nil concrete value? I guess it's the new method to return the original version? Can we fix that. I'm not cool with this sort of defensiveness creeping in. We should nil-check when dealing in concrete values before returning interfaces. I feel like we should put a linter in play to check for this.
pkg/sql/catalog/tabledesc/structured.go, line 1014 at r5 (raw file):
// ClusterVersion returns the version of the table descriptor read from the // store, if any.
Can you put a TODO with my name on it here to make this deal in catalog.TableDescriptor
? IIRC the use of the raw struct was laziness on my part to not have to change as much code.
pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):
// the mutations. func (desc *Mutable) OriginalDescriptor() catalog.Descriptor { return desc.original
let's avoid the nil value semantics by doing the following:
if desc.original != nil {
return desc.original
}
return nil
54090d5
to
d965162
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tenant logic test
Just double checking we saw the tests under pkg/ccl/spanconfigccl/spanconfiglimiterccl/testdata/*
, they're doing what I think we'd do under a tenant logic test except with directives to re-configure the per-tenant limit dynamically. Are you asking to test something different? With the logic test framework running under a tenant, I think I'd need to introduce special commands/testing knobs to be able to re-configure the limit from the host tenant.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)
pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):
Previously, ajwerner wrote…
let's avoid the nil value semantics by doing the following:
if desc.original != nil { return desc.original } return nil
Done. Huh, looking at the diff I wouldn't have thought there was any behavioural difference; seems like Go footgun.
d965162
to
433df2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#79426 I filed this to make it easier.
What I'm looking for is a test that shows errors when trying to create tables or add partitioning or doing various other activities and then exercising recovery from these various scenarios. I see one test which exercises the error. It'd be good to have a test that shows that the GC job will recover some quota when it finishes. It'd be good to show that the user can, say, remove partitioning and find that that will allow it to create tables again.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @irfansharif)
pkg/sql/catalog/descs/txn.go, line 262 at r6 (raw file):
} if shouldLimit { return errExceededSpanCountLimit
something, somewhere, should be putting a pgcode around this error.
pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Done. Huh, looking at the diff I wouldn't have thought there was any behavioural difference; seems like Go footgun.
This is the #1 footgun in the language as far as I'm concerned. The mindset you need to have is that any time you see an interface, it should be thought of as a box with two values [ptr to itable for type][ptr to value]
. With the old code, you'd be returning [ptr to catalog.Descriptor itable for *immutable][nil]
and now you're returning [nil][nil]
. Only the latter compares equally to nil
. https://research.swtch.com/interfaces is worth a read every few years.
0a172e2
to
4d45de2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to show that the user can, say, remove partitioning and find that that will allow it to create tables again.
Added a few tests that show recovery of quota from repartitioning and dropping indexes. In writing these tests I'm running into something that feels like a bug outside of span configs. Running the following test:
exec-sql tenant=10
CREATE TABLE db.list_partitions(i INT PRIMARY KEY, j INT);
CREATE INDEX idx_i ON db.list_partitions (i);
CREATE INDEX idx_j ON db.list_partitions (j);
ALTER INDEX db.list_partitions@idx_i PARTITION BY LIST (i) (
PARTITION one_and_five VALUES IN (1, 5),
PARTITION everything_else VALUES IN (DEFAULT)
);
Showed that the span count accounting only added a delta for the descriptor version between CREATE INDEX idx_j ON db.list_partitions (j);
and ALTER INDEX db.list_partitions@idx_i PARTITION BY LIST (i)
. Because the block above was being committed in the same txn, the "original descriptor" (which I would've expected to be nil) was one with .GetVersion() == 1
. I'm not sure what's going on. Handling .GetVersion() == 1
explicitly doesn't work either, since we need to distinguish between an original descriptor with version == 1 that was newly created in the current txn vs. an original descriptor with version == 1 created in a previously committed one. @ajwerner mentioned he'd play around with this repro to see if something stands out.
It'd be good to have a test that shows that the GC job will recover some quota when it finishes.
TestDropTableLowersSpanCount
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)
pkg/sql/catalog/descs/txn.go, line 262 at r6 (raw file):
Previously, ajwerner wrote…
something, somewhere, should be putting a pgcode around this error.
Done, just did it here (pgcode.ConfigurationLimitExceeded felt eerily appropriate).
9a55d46
to
3e196df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In writing these tests I'm running into something that feels like a bug outside of span configs.
Tests here work just fine after rebasing on top of #79561. I'll merge after that goes in. Is there anything outstanding in this PR?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)
3e196df
to
a863132
Compare
Fixes cockroachdb#70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It takes in a delta (typically the difference // in span configs between the committed and uncommitted state in // the txn), uses it to maintain an aggregate counter, and informs // the caller if exceeding the prescribed limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, delta int, ) (bool, error) } The delta is computed using a static helper, spanconfig.Delta: // Delta considers both the committed and uncommitted state of a // table descriptor and computes the difference in the number of // spans we can apply a configuration over. func Delta( ctx context.Context, s Splitter, committed, uncommitted catalog.TableDescriptor, ) (int, error) This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVersionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit; - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (cockroachdb#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t42(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change Release note: None
a863132
to
90054c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any tools to re-compute the count or to reset it manually? It'd make me feel a lot better if we did. If there were a leak but we had an easy mitigation, that'd be nice. What is the writability of the column by root
? What sort of access do serverless tenants have? I think ideally the table would only be writable by node
and then we'd have a builtin or something to recompute the count by scanning all of the descriptors and another which does both a recompute and set operation, perhaps also requiring that it's a single-statement transaction. Can you file follow-up task to build something?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#79616, I'll bors on green. Thanks for reviewing and the speedy bug fix!
bors r+ |
Build succeeded: |
Fixes #70555. In order to limit the number of span configs a tenant's
able to install, we introduce a tenant-side spanconfig.Limiter. It
presents the following interface:
This limiter only applies to secondary tenants. The counter is
maintained in a newly introduced (tenant-only) system table, using the
following schema:
We need just two integration points for spanconfig.Limiter:
the committed and to-be-committed descriptor state before txn commit.
count when GC-ing the table for good.
The per-tenant span config limit used is controlled by a new tenant
read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster
settings (#73857) provides the infrastructure for the host tenant to be
able to control this setting cluster wide, or to target a specific
tenant at a time.
We also need a migration here, to start tracking span counts for
clusters with pre-existing tenants. We introduce a migration that scans
over all table descriptors and seeds system.span_count with the right
value. Given cluster version gates disseminate asynchronously, we also
need a preliminary version to start tracking incremental changes.
It's useful to introduce the notion of debt. This will be handy if/when
we lower per-tenant limits, and also in the migration above since it's
possible for pre-existing tenants to have committed state in violation
of the prescribed limit. When in debt, schema changes that add new
splits will be rejected (dropping tables/indexes/partitions/etc. will
work just fine).
When attempting a txn that goes over the configured limit, the UX is as
follows:
Release note: None
Release justification: low risk, high benefit change