Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: distributed token bucket RFC #66436

Merged
merged 1 commit into from
Sep 16, 2021
Merged

Conversation

RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde commented Jun 14, 2021

Link to RFC text.


This is an initial draft.

Design of a subsystem relevant in the multi-tenant setting
("serverless") which rate limits tenant KV operations in conformance
to a budget target.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory first pass. I haven't internalized the shares yet.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

BEGIN;
-- Get the latest state (with the largest sequence number).
SELECT * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;

nit: FOR UPDATE.


docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):

```sql
SELECT seq FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1 OFFSET 1000;

We have reverse scanning, why not go from the bottom up?


docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Quoted 4 lines of code…
TODO: it's unfortunate that we will have to talk to two ranges on each
operation. Could we achieve this with a single index? We could reserve part of
the UUID space for the sequence number, allowing one index to serve a
dual-purpose, but things would become messy.

Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.

Consider a table that has the primary key as just the tenant_id and stores the cumulative stats and then a ledger which is keyed on (tenant_id, op_id) but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.


docs/RFCS/20210604_distributed_token_bucket.md, line 325 at r1 (raw file):

   ) (
     updateTimestamp time.Time,
     grantedTokens time.Time,

is this type right?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, ajwerner wrote…

nit: FOR UPDATE.

We're not updating any existing row, does FOR UPDATE still help in this case?


docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):

Previously, ajwerner wrote…

We have reverse scanning, why not go from the bottom up?

I guess since these things are sequential, we can just read the latest one and subtract 1000. Updated.


docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Previously, ajwerner wrote…
TODO: it's unfortunate that we will have to talk to two ranges on each
operation. Could we achieve this with a single index? We could reserve part of
the UUID space for the sequence number, allowing one index to serve a
dual-purpose, but things would become messy.

Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.

Consider a table that has the primary key as just the tenant_id and stores the cumulative stats and then a ledger which is keyed on (tenant_id, op_id) but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.

It feels roughly equivalent to me. Updating a single row is no better than inserting new rows (given that we're keeping all versions internally). The GC just happens at another level.

I don't see how that would help with single range though - we have many tenants, so we would need to interleave the two tables. I guess we could create a table per tenant but that could be a lot of tables (and I think the proposed schema would work fine too in that case).

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

Copy link
Contributor

@kernfeld-cockroach kernfeld-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file):

```sql
SELECT crdb_internal(

Do we need to give this SQL function a name? Otherwise this function looks good.

@RaduBerinde
Copy link
Member Author


docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Do we need to give this SQL function a name? Otherwise this function looks good.

Oops, done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, RaduBerinde wrote…

We're not updating any existing row, does FOR UPDATE still help in this case?

Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.


docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.


docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

idempotency). The sequence numbers allow locating the latest entry.

Schema for the system table:

One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, ajwerner wrote…

Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.

I see, so FOR UPDATE would force these two txn to deal with each other eariler. Done.


docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Previously, ajwerner wrote…

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.

I don't know if that would be desirable - it might be a bottleneck since all tenants would be hitting the same range all the time. I'd expect the table to split as necessary based on load (even if it would otherwise fit in one range).


docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, ajwerner wrote…

One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?

I think we could start with an initial amount of tokens and treat that as debt in the first request.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, RaduBerinde wrote…

I think we could start with an initial amount of tokens and treat that as debt in the first request.

Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, ajwerner wrote…

Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.

Done, mentioned it under the "Initial amount" knob.

Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @kernfeld-cockroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):

 - if there are sufficient "burst" tokens already in the bucket, the tokens are
   granted immediately;
 - otherwise, a fraction of the global refill rate is granted to the node and

How is the fraction determined? Is it the number of tenants with a current share > 0?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

How is the fraction determined? Is it the number of tenants with a current share > 0?

I added to the paragraph below. It's just the ratio of node shares to total shares.

Copy link
Contributor

@kernfeld-cockroach kernfeld-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):
To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:

Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.


docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):

## Configuration API

What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?


docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

```sql
SELECT crdb_internal.update_tenant_resource_limits(

How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.


docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

 - the refill that would have happened in the delta time period.

## Resilience considerations

Is my understanding correct that, if crdb_internal.update_tenant_resource_limits isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

inaccessible. To achieve this, in the short term each node continues operating
at the previous rate of consumption. Longer term, the rate can decay over time
toward 1/N of the total refill rate, where N is the number of SQL pods.

How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.

@RaduBerinde RaduBerinde force-pushed the distbucket-rfc branch 2 times, most recently from d82937e to ebeb800 Compare June 15, 2021 15:53
Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:

Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.

Done.


docs/RFCS/20210604_distributed_token_bucket.md, line 246 at r4 (raw file):

idempotency). The sequence numbers allow locating the latest entry.

Schema for the system table:

To give credit where it is due, this ledger approach was suggested by @andy-kimball


docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?

Added


docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.

Yes it would return after the entry is inserted.

I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing

SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)

docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Is my understanding correct that, if crdb_internal.update_tenant_resource_limits isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.

I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.

I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).

Copy link
Contributor

@kernfeld-cockroach kernfeld-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, RaduBerinde wrote…

Yes it would return after the entry is inserted.

I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing

SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)

Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?


docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

Previously, RaduBerinde wrote…

I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.

Okay great, that's what I was trying to say.


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, RaduBerinde wrote…

I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).

I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.


docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

-- Calculate new state.
-- Set the new state, using the next sequence number.
INSERT INTO tenant_usage (tenant_id, seq, ...) (tenant_id, seq+1, ...);

From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?

Yes.


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.

In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).


docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?

We need to read the current bucket state no matter what.

Copy link
Contributor

@kernfeld-cockroach kernfeld-cockroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, RaduBerinde wrote…

In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).

I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.


docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, RaduBerinde wrote…

We need to read the current bucket state no matter what.

That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)


docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.

CC @andy-kimball


docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?

Hm, no, it's critical that we read the most current state (or we'd lose track of consumed units, among other issues).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)


docs/RFCS/20210604_distributed_token_bucket.md, line 242 at r5 (raw file):

## System table

The system table uses a "ledger" approach; each change to the state of a global

What happens when the rate changes by 10x at all 10 pods of the tenant and they come one at a time and execute their transaction? Will they get different shares based on their arrival order?


docs/RFCS/20210604_distributed_token_bucket.md, line 261 at r5 (raw file):

  ru_current FLOAT NOT NULL,
  current_share_sum FLOAT NOT NULL,

is each tenant row maintaining a share sum across all tenants? Or maybe there is a total order across all rows since the earlier text mentioned "ledger" -- I am not seeing how that total order comes about. Is it the seq?


docs/RFCS/20210604_distributed_token_bucket.md, line 281 at r5 (raw file):

BEGIN;
-- Get the latest state (with the largest sequence number).
SELECT FOR UPDATE * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;

does this tenant not need the state of other tenants to computes its share of the rate?


docs/RFCS/20210604_distributed_token_bucket.md, line 380 at r5 (raw file):

The system must have reasonable behavior if the bucket range becomes temporarily
inaccessible. To achieve this, in the short term each node continues operating

What is "short term" here?
This dependency on a fully functioning transactional system worries me, and even more so for a geo partitioned cluster.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 21, 2021
This change adds the TokenBucket API proposed in the RFC (cockroachdb#66436), a
stub implementation and client for it, and the corresponding KV
connector interface.

The client and server-side code lives in
ccl/multitenantccl/tenantcostclient and tenantcostserver.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 22, 2021
This change adds the TokenBucket API proposed in the RFC (cockroachdb#66436), a
stub implementation and client for it, and the corresponding KV
connector interface.

The client and server-side code lives in
ccl/multitenantccl/tenantcostclient and tenantcostserver.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 22, 2021
67067: server: require admin role to access node status r=bdarnell a=knz

Release note (security update): The node status retrieval endpoints
over HTTP (`/_status/nodes`, `/_status/nodes/<N>` and the web UI
`/#/reports/nodes`) have been updated to require the `admin` role from
the requesting user. This ensures that operational details such as
network addresses and command-line flags do not leak to unprivileged
users.

67733: colexecbase: extend support of casts r=yuzefovich a=yuzefovich

Addresses: #48135

See individual commits for details. After this PR we only need to add
more casts between natively supported types.

67768: sql, server: add skeleton TokenBucket connector and tenant resource limits configuration APIs r=RaduBerinde a=RaduBerinde

This PR is a scaled back version of #67508 where we don't use the system table at all. It's meant to put some of the infrastructure pieces in place and provide a stub API for reconfiguration.

The plan is to add consumption metrics on top of this soon so that CC can develop in parallel.

---

#### server: add TokenBucket connector API

This change adds the TokenBucket API proposed in the RFC (#66436), a
stub implementation and client for it, and the corresponding KV
connector interface.

The client and server-side code lives in
ccl/multitenantccl/tenantcostclient and tenantcostserver.

Release note: None

#### sql: tenant resource limits configuration API

This commit adds a `crdb_internal.update_tenant_resource_limits`
internal SQL function (to be used by the system tenant) which updates
the token bucket configuration for a specific tenant.

Release note: None


67840: sql: add test for creating stats on tables with expression indexes r=mgartner a=mgartner

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 27, 2021
Add the system table described in the RFC (cockroachdb#66436).

The table is only created for the system tenant.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 28, 2021
Add the system table described in the RFC (cockroachdb#66436).

The table is only created for the system tenant.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 29, 2021
Add the system table described in the RFC (cockroachdb#66436).

The table is only created for the system tenant.

Release note: None
Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Andrew's idea for the system table (after refining it together a bit more). This solves the problem of cleaning up the shares when an instance goes away; we no longer need the shares "decay". Also see #68115 where this is prototyped.

I still plan to address the open comments around more examples and calculations for how much KV churn we expect to generate.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, and @tbg)

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 30, 2021
Add the system table described in the RFC (cockroachdb#66436).

The table is only created for the system tenant.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 31, 2021
68115: tenantcostserver: use the tenant_usage system table r=RaduBerinde a=RaduBerinde

This PR prototypes a new tenant_usage schema  based on @ajwerner's suggestion in the RFC (#66436). I will update the RFC after getting some initial feedback here.

#### sql: add tenant_usage system table

Add the system table described in the RFC (#66436).

The table is only created for the system tenant.

Release note: None

#### tenantcostserver: use the tenant_usage system table

This change implements most of the interaction with the tenant_usage
system table, with the exception of dead instances detection and
clean-up.

We currently tolerate an empty table, but it would be cleaner to
initialize the tenant state (instance_id=0 row) at tenant creation
time (+ implement a migration). I will explore this in a future
change, when we add some configurable defaults for the refill rate
etc.

Release note: None

Co-authored-by: Radu Berinde <[email protected]>
- a limit on how many unused tokens (RUs) we can accumulate as burst. Refill is
essentially paused when the bucket has more tokens than the limit.

Each SQL pod implements a local token bucket and uses it for admission control
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.

  • What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)


docs/RFCS/20210604_distributed_token_bucket.md, line 70 at r12 (raw file):

Previously, vy-ton (Vy Ton) wrote…

I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.

  • What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?

In its current incarnation, overload control does not reject operations, it just delays them.

But it is a very good question - if an operation required tokens and it later hits an error, we should not report those RUs as consumed.

Design of a subsystem relevant in the multi-tenant setting
("serverless") which rate limits tenant KV operations in conformance
to a budget target.

Release note: None
Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some back-of-the-envelope calculations for the workload on the system table.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)


docs/RFCS/20210604_distributed_token_bucket.md, line 120 at r5 (raw file):

Previously, sumeerbhola wrote…

+1 for including such back of the envelope calculations.
I think the number of versions is limited since each update is using a new seqnum, but it is worth calling out in the Performance considerations section.

Done.

@RaduBerinde
Copy link
Member Author

I will merge this for now because I don't have much time to improve it in the immediate future.

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 16, 2021

Build succeeded:

@craig craig bot merged commit d3a0546 into cockroachdb:master Sep 16, 2021
@RaduBerinde RaduBerinde deleted the distbucket-rfc branch September 21, 2021 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.