Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: limit # of tenant span configs #77337

Merged
merged 1 commit into from
Apr 8, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Mar 3, 2022

Fixes #70555. In order to limit the number of span configs a tenant's
able to install, we introduce a tenant-side spanconfig.Limiter. It
presents the following interface:

// Limiter is used to limit the number of span configs installed by
// secondary tenants. It considers the committed and uncommitted
// state of a table descriptor and computes the "span" delta, each
// unit we can apply a configuration over. It uses these deltas to
// maintain an aggregate counter, informing the caller if exceeding
// the configured limit.
type Limiter interface {
  ShouldLimit(
    ctx context.Context, txn *kv.Txn,
    committed, uncommitted catalog.TableDescriptor,
  ) (bool, error)
}

This limiter only applies to secondary tenants. The counter is
maintained in a newly introduced (tenant-only) system table, using the
following schema:

CREATE TABLE system.span_count (
  singleton  BOOL DEFAULT TRUE,
  span_count INT NOT NULL,
  CONSTRAINT "primary" PRIMARY KEY (singleton),
  CONSTRAINT single_row CHECK (singleton),
  FAMILY "primary" (singleton, span_count)
);

We need just two integration points for spanconfig.Limiter:

  • Right above CheckTwoVersionInvariant, where we're able to hook into
    the committed and to-be-committed descriptor state before txn commit.
  • In the GC job, when gc-ing table state. We decrement a table's split
    count when GC-ing the table for good.

The per-tenant span config limit used is controlled by a new tenant
read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster
settings (#73857) provides the infrastructure for the host tenant to be
able to control this setting cluster wide, or to target a specific
tenant at a time.

We also need a migration here, to start tracking span counts for
clusters with pre-existing tenants. We introduce a migration that scans
over all table descriptors and seeds system.span_count with the right
value. Given cluster version gates disseminate asynchronously, we also
need a preliminary version to start tracking incremental changes.

It's useful to introduce the notion of debt. This will be handy if/when
we lower per-tenant limits, and also in the migration above since it's
possible for pre-existing tenants to have committed state in violation
of the prescribed limit. When in debt, schema changes that add new
splits will be rejected (dropping tables/indexes/partitions/etc. will
work just fine).

When attempting a txn that goes over the configured limit, the UX is as
follows:

> CREATE TABLE db.t2(i INT PRIMARY KEY);
pq: exceeded limit for number of table spans

Release note: None
Release justification: low risk, high benefit change

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 220225.schema-limiter branch 3 times, most recently from 8004991 to fe3b095 Compare March 8, 2022 18:31
@irfansharif irfansharif changed the title [wip,dnm] spanconfig: limit # of tenant span configs spanconfig: limit # of tenant span configs Mar 8, 2022
@irfansharif irfansharif force-pushed the 220225.schema-limiter branch 2 times, most recently from e7a5752 to 0fa1e12 Compare March 9, 2022 05:25
@irfansharif irfansharif force-pushed the 220225.schema-limiter branch from 0fa1e12 to 9530af9 Compare March 9, 2022 23:31
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 10, 2022
The earlier implementation relied on decoding keys in order to determine
precise split points. When plugging this library into the component
designed for tenant-side span config limiting (cockroachdb#77337, to address
\cockroachdb#70555), we realized it's not possible to grab a hold of type-hydrated
table descriptors (needed for decoding). This is because today it's
possible to GC type descriptors before GC-ing table descriptors that
refer to them. Given the integration point for this library was around
the GC job, we had to forego decoding routines.

Instead of computing the precise split keys, we can however compute how
many there are without having to look at keys at all. We

Consider our table hierarchy:

    table -> index -> partition -> partition -> (...) -> partition

Each partition is either a PARTITION BY LIST kind (where it can then be
further partitioned, or not), or a PARTITION BY RANGE kind (no further
partitioning possible). We can classify each parent-child link into two
types:

(a) Contiguous      {index, list partition} -> range partition
(b) Non-contiguous  table -> index, {index, list partition} -> list partition

- Contiguous links are the sort where each child span is contiguous with
  another, and that the set of all child spans encompass the parent's
  span. For an index that's partitioned by range:

    CREATE TABLE db.range(i INT PRIMARY KEY, j INT)
    PARTITION BY RANGE (i) (
      PARTITION less_than_five       VALUES FROM (minvalue) to (5),
      PARTITION between_five_and_ten VALUES FROM (5) to (10),
      PARTITION greater_than_ten     VALUES FROM (10) to (maxvalue)
    );

  With table ID as 106, the parent index span is /Table/106/{1-2}. The
  child spans are /Table/106/1{-/5}, /Table/106/1/{5-10} and
  /Table/106/{1/10-2}. They're contiguous; put together they wholly
  encompass the parent span.

- Non-contiguous links, by contrast, are when child spans are neither
  contiguous with respect to one another, nor do they start and end at
  the parent span's boundaries. For a table with a secondary index:

    CREATE TABLE db.t(i INT PRIMARY KEY, j INT);
    CREATE INDEX idx ON db.t (j);
    DROP INDEX db.t@idx;
    CREATE INDEX idx ON db.t (j);

  With table ID as 106, the parent table span is /Table/10{6-7}. The
  child spans are /Table/106/{1-2} and /Table/106/{3-4}. Compared to the
  parent span, we're missing /Table/106{-/1}, /Table/106/{2-3},
  /Table/10{6/4-7}.

For N children:
- For a contiguous link, the number of splits equals the number of child
  elements (i.e. N).
- For a non-contiguous link, the number of splits equals N + 1 + N. For N
  children, there are N - 1 gaps. There are also 2 gaps at the start and end
  of the parent span. Summing that with the N children span themselves, we
  get to the formula above. This assumes that the N child elements aren't
  further subdivided, if they are (we can compute it recursively), the
  formula becomes N + 1 + Σ(grand child spans).

Computing split count this way does come with one down-side: we might be
overcounting. When comparing keys, we could for example recognize that
partition-by-list values are adjacent, therefore there's no "gap"
between them. We can also do this by potentially comparing encoded keys
with one other. We just didn't because it's more annoying code to write,
and over-counting here is more than fine for our purposes usages.

Release justification: non-production code
Release note: None
@irfansharif irfansharif force-pushed the 220225.schema-limiter branch 8 times, most recently from 73583d2 to 65a4d86 Compare March 11, 2022 03:54
@irfansharif irfansharif marked this pull request as ready for review March 11, 2022 03:55
@irfansharif irfansharif requested a review from a team as a code owner March 11, 2022 03:55
@irfansharif irfansharif requested review from a team March 11, 2022 03:55
@irfansharif irfansharif requested a review from a team as a code owner March 11, 2022 03:55
@irfansharif irfansharif removed the request for review from a team March 11, 2022 03:55
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)


pkg/sql/catalog/tabledesc/structured.go, line 53 at r1 (raw file):

Previously, ajwerner wrote…

I'd rather you just had the one. If you're sad that ClusterVersion is a field as opposed to a method, feel free to change it to be a method that exposes a value stored in an unexported field. I don't like having both.

Done.

@irfansharif
Copy link
Contributor Author

CI failure is an unrelated Google Cloud Storage flake, we've been playing around with stuff in TC I hear.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a tenant logic test that shows the end-to-end behavior working as expected?

Reviewed 10 of 52 files at r1, 11 of 23 files at r2, 1 of 11 files at r4, 22 of 30 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @irfansharif)


pkg/spanconfig/spanconfig.go, line 494 at r5 (raw file):

}

func isNil(table catalog.TableDescriptor) bool {

this is upsetting, when do you find a value being returned for which the nil-check doesn't work because it's a non-nil concrete value? I guess it's the new method to return the original version? Can we fix that. I'm not cool with this sort of defensiveness creeping in. We should nil-check when dealing in concrete values before returning interfaces. I feel like we should put a linter in play to check for this.


pkg/sql/catalog/tabledesc/structured.go, line 1014 at r5 (raw file):

// ClusterVersion returns the version of the table descriptor read from the
// store, if any.

Can you put a TODO with my name on it here to make this deal in catalog.TableDescriptor? IIRC the use of the raw struct was laziness on my part to not have to change as much code.


pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):

// the mutations.
func (desc *Mutable) OriginalDescriptor() catalog.Descriptor {
	return desc.original

let's avoid the nil value semantics by doing the following:

if desc.original != nil {
    return desc.original
}
return nil

@irfansharif irfansharif force-pushed the 220225.schema-limiter branch from 54090d5 to d965162 Compare April 4, 2022 15:44
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenant logic test

Just double checking we saw the tests under pkg/ccl/spanconfigccl/spanconfiglimiterccl/testdata/*, they're doing what I think we'd do under a tenant logic test except with directives to re-configure the per-tenant limit dynamically. Are you asking to test something different? With the logic test framework running under a tenant, I think I'd need to introduce special commands/testing knobs to be able to re-configure the limit from the host tenant.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)


pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):

Previously, ajwerner wrote…

let's avoid the nil value semantics by doing the following:

if desc.original != nil {
    return desc.original
}
return nil

Done. Huh, looking at the diff I wouldn't have thought there was any behavioural difference; seems like Go footgun.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#79426 I filed this to make it easier.

What I'm looking for is a test that shows errors when trying to create tables or add partitioning or doing various other activities and then exercising recovery from these various scenarios. I see one test which exercises the error. It'd be good to have a test that shows that the GC job will recover some quota when it finishes. It'd be good to show that the user can, say, remove partitioning and find that that will allow it to create tables again.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @irfansharif)


pkg/sql/catalog/descs/txn.go, line 262 at r6 (raw file):

		}
		if shouldLimit {
			return errExceededSpanCountLimit

something, somewhere, should be putting a pgcode around this error.


pkg/sql/catalog/tabledesc/structured.go, line 1025 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done. Huh, looking at the diff I wouldn't have thought there was any behavioural difference; seems like Go footgun.

This is the #1 footgun in the language as far as I'm concerned. The mindset you need to have is that any time you see an interface, it should be thought of as a box with two values [ptr to itable for type][ptr to value]. With the old code, you'd be returning [ptr to catalog.Descriptor itable for *immutable][nil] and now you're returning [nil][nil]. Only the latter compares equally to nil. https://research.swtch.com/interfaces is worth a read every few years.

@irfansharif irfansharif force-pushed the 220225.schema-limiter branch 2 times, most recently from 0a172e2 to 4d45de2 Compare April 6, 2022 18:23
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to show that the user can, say, remove partitioning and find that that will allow it to create tables again.

Added a few tests that show recovery of quota from repartitioning and dropping indexes. In writing these tests I'm running into something that feels like a bug outside of span configs. Running the following test:

exec-sql tenant=10
CREATE TABLE db.list_partitions(i INT PRIMARY KEY, j INT);
CREATE INDEX idx_i ON db.list_partitions (i);
CREATE INDEX idx_j ON db.list_partitions (j);
ALTER INDEX db.list_partitions@idx_i PARTITION BY LIST (i) (
  PARTITION one_and_five    VALUES IN (1, 5),
  PARTITION everything_else VALUES IN (DEFAULT)
);

Showed that the span count accounting only added a delta for the descriptor version between CREATE INDEX idx_j ON db.list_partitions (j); and ALTER INDEX db.list_partitions@idx_i PARTITION BY LIST (i). Because the block above was being committed in the same txn, the "original descriptor" (which I would've expected to be nil) was one with .GetVersion() == 1. I'm not sure what's going on. Handling .GetVersion() == 1 explicitly doesn't work either, since we need to distinguish between an original descriptor with version == 1 that was newly created in the current txn vs. an original descriptor with version == 1 created in a previously committed one. @ajwerner mentioned he'd play around with this repro to see if something stands out.

It'd be good to have a test that shows that the GC job will recover some quota when it finishes.

TestDropTableLowersSpanCount.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)


pkg/sql/catalog/descs/txn.go, line 262 at r6 (raw file):

Previously, ajwerner wrote…

something, somewhere, should be putting a pgcode around this error.

Done, just did it here (pgcode.ConfigurationLimitExceeded felt eerily appropriate).

@irfansharif irfansharif force-pushed the 220225.schema-limiter branch 2 times, most recently from 9a55d46 to 3e196df Compare April 7, 2022 04:35
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In writing these tests I'm running into something that feels like a bug outside of span configs.

Tests here work just fine after rebasing on top of #79561. I'll merge after that goes in. Is there anything outstanding in this PR?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)

@irfansharif irfansharif force-pushed the 220225.schema-limiter branch from 3e196df to a863132 Compare April 7, 2022 20:25
Fixes cockroachdb#70555. In order to limit the number of span configs a tenant's
able to install, we introduce a tenant-side spanconfig.Limiter. It
presents the following interface:

    // Limiter is used to limit the number of span configs installed by
    // secondary tenants. It takes in a delta (typically the difference
    // in span configs between the committed and uncommitted state in
    // the txn), uses it to maintain an aggregate counter, and informs
    // the caller if exceeding the prescribed limit.
    type Limiter interface {
      ShouldLimit(
        ctx context.Context, txn *kv.Txn, delta int,
      ) (bool, error)
    }

The delta is computed using a static helper, spanconfig.Delta:

    // Delta considers both the committed and uncommitted state of a
    // table descriptor and computes the difference in the number of
    // spans we can apply a configuration over.
    func Delta(
      ctx context.Context, s Splitter,
      committed, uncommitted catalog.TableDescriptor,
    ) (int, error)

This limiter only applies to secondary tenants. The counter is
maintained in a newly introduced (tenant-only) system table, using the
following schema:

    CREATE TABLE system.span_count (
      singleton  BOOL DEFAULT TRUE,
      span_count INT NOT NULL,
      CONSTRAINT "primary" PRIMARY KEY (singleton),
      CONSTRAINT single_row CHECK (singleton),
      FAMILY "primary" (singleton, span_count)
    );

We need just two integration points for spanconfig.Limiter:
- Right above CheckTwoVersionInvariant, where we're able to hook into
  the committed and to-be-committed descriptor state before txn commit;
- In the GC job, when gc-ing table state. We decrement a table's split
  count when GC-ing the table for good.

The per-tenant span config limit used is controlled by a new tenant
read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster
settings (cockroachdb#73857) provides the infrastructure for the host tenant to be
able to control this setting cluster wide, or to target a specific
tenant at a time.

We also need a migration here, to start tracking span counts for
clusters with pre-existing tenants. We introduce a migration that scans
over all table descriptors and seeds system.span_count with the right
value. Given cluster version gates disseminate asynchronously, we also
need a preliminary version to start tracking incremental changes.

It's useful to introduce the notion of debt. This will be handy if/when
we lower per-tenant limits, and also in the migration above since it's
possible for pre-existing tenants to have committed state in violation
of the prescribed limit. When in debt, schema changes that add new
splits will be rejected (dropping tables/indexes/partitions/etc. will
work just fine).

When attempting a txn that goes over the configured limit, the UX is as
follows:

    > CREATE TABLE db.t42(i INT PRIMARY KEY);
    pq: exceeded limit for number of table spans

Release note: None
Release justification: low risk, high benefit change

Release note: None
@irfansharif irfansharif force-pushed the 220225.schema-limiter branch from a863132 to 90054c6 Compare April 7, 2022 20:58
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tools to re-compute the count or to reset it manually? It'd make me feel a lot better if we did. If there were a leak but we had an easy mitigation, that'd be nice. What is the writability of the column by root? What sort of access do serverless tenants have? I think ideally the table would only be writable by node and then we'd have a builtin or something to recompute the count by scanning all of the descriptors and another which does both a recompute and set operation, perhaps also requiring that it's a single-statement transaction. Can you file follow-up task to build something?

Beyond that, :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#79616, I'll bors on green. Thanks for reviewing and the speedy bug fix!

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 8, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: limit the number of kv spans configurable by a tenant
4 participants