rfc: distributed token bucket RFC #66436

RaduBerinde · 2021-06-14T16:36:04Z

This is an initial draft.

Design of a subsystem relevant in the multi-tenant setting
("serverless") which rate limits tenant KV operations in conformance
to a budget target.

Release note: None

cockroach-teamcity · 2021-06-14T16:36:12Z

This change is

ajwerner

Cursory first pass. I haven't internalized the shares yet.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

BEGIN;
-- Get the latest state (with the largest sequence number).
SELECT * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;

nit: FOR UPDATE.

docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):

```sql
SELECT seq FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1 OFFSET 1000;

We have reverse scanning, why not go from the bottom up?

docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Quoted 4 lines of code…

TODO: it's unfortunate that we will have to talk to two ranges on each
operation. Could we achieve this with a single index? We could reserve part of
the UUID space for the sequence number, allowing one index to serve a
dual-purpose, but things would become messy.

Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.

Consider a table that has the primary key as just the tenant_id and stores the cumulative stats and then a ledger which is keyed on (tenant_id, op_id) but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.

docs/RFCS/20210604_distributed_token_bucket.md, line 325 at r1 (raw file):

   ) (
     updateTimestamp time.Time,
     grantedTokens time.Time,

is this type right?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, ajwerner wrote…

nit: FOR UPDATE.

We're not updating any existing row, does FOR UPDATE still help in this case?

docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):

Previously, ajwerner wrote…

We have reverse scanning, why not go from the bottom up?

I guess since these things are sequential, we can just read the latest one and subtract 1000. Updated.

docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Previously, ajwerner wrote…

TODO: it's unfortunate that we will have to talk to two ranges on each
operation. Could we achieve this with a single index? We could reserve part of
the UUID space for the sequence number, allowing one index to serve a
dual-purpose, but things would become messy.
Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.

Consider a table that has the primary key as just the tenant_id and stores the cumulative stats and then a ledger which is keyed on (tenant_id, op_id) but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.

It feels roughly equivalent to me. Updating a single row is no better than inserting new rows (given that we're keeping all versions internally). The GC just happens at another level.

I don't see how that would help with single range though - we have many tenants, so we would need to interleave the two tables. I guess we could create a table per tenant but that could be a lot of tables (and I think the proposed schema would work fine too in that case).

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

kernfeld-cockroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file):

```sql
SELECT crdb_internal(

Do we need to give this SQL function a name? Otherwise this function looks good.

RaduBerinde · 2021-06-14T18:51:28Z

docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Do we need to give this SQL function a name? Otherwise this function looks good.

Oops, done.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, RaduBerinde wrote…

We're not updating any existing row, does FOR UPDATE still help in this case?

Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.

docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.

docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

idempotency). The sequence numbers allow locating the latest entry.

Schema for the system table:

One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):

Previously, ajwerner wrote…

Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.

I see, so FOR UPDATE would force these two txn to deal with each other eariler. Done.

docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):

Previously, ajwerner wrote…

A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.

Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.

I don't know if that would be desirable - it might be a bottleneck since all tenants would be hitting the same range all the time. I'd expect the table to split as necessary based on load (even if it would otherwise fit in one range).

docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, ajwerner wrote…

One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?

I think we could start with an initial amount of tokens and treat that as debt in the first request.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, RaduBerinde wrote…

I think we could start with an initial amount of tokens and treat that as debt in the first request.

Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):

Previously, ajwerner wrote…

Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.

Done, mentioned it under the "Initial amount" knob.

cucaroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @kernfeld-cockroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):

 - if there are sufficient "burst" tokens already in the bucket, the tokens are
   granted immediately;
 - otherwise, a fraction of the global refill rate is granted to the node and

How is the fraction determined? Is it the number of tenants with a current share > 0?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

How is the fraction determined? Is it the number of tenants with a current share > 0?

I added to the paragraph below. It's just the ratio of node shares to total shares.

kernfeld-cockroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):
To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:

Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.

docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):

## Configuration API

What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?

docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

```sql
SELECT crdb_internal.update_tenant_resource_limits(

How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.

docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

 - the refill that would have happened in the delta time period.

## Resilience considerations

Is my understanding correct that, if crdb_internal.update_tenant_resource_limits isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

inaccessible. To achieve this, in the short term each node continues operating
at the previous rate of consumption. Longer term, the rate can decay over time
toward 1/N of the total refill rate, where N is the number of SQL pods.

How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.

RaduBerinde

Thanks for the comments!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:

Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.

Done.

docs/RFCS/20210604_distributed_token_bucket.md, line 246 at r4 (raw file):

idempotency). The sequence numbers allow locating the latest entry.

Schema for the system table:

To give credit where it is due, this ledger approach was suggested by @andy-kimball

docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?

Added

docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.

Yes it would return after the entry is inserted.

I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing

SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)

docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Is my understanding correct that, if crdb_internal.update_tenant_resource_limits isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.

I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.

I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).

kernfeld-cockroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, RaduBerinde wrote…

Yes it would return after the entry is inserted.

I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing
SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)

Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?

docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):

Previously, RaduBerinde wrote…

I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.

Okay great, that's what I was trying to say.

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, RaduBerinde wrote…

I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).

I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.

docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

-- Calculate new state.
-- Set the new state, using the next sequence number.
INSERT INTO tenant_usage (tenant_id, seq, ...) (tenant_id, seq+1, ...);

From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?

Yes.

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.

In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).

docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?

We need to read the current bucket state no matter what.

kernfeld-cockroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, RaduBerinde wrote…

In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).

I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.

docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, RaduBerinde wrote…

We need to read the current bucket state no matter what.

That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)

docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.

CC @andy-kimball

docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):

Previously, kernfeld-cockroach (Paul Kernfeld) wrote…

That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?

Hm, no, it's critical that we read the most current state (or we'd lose track of consumed units, among other issues).

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)

docs/RFCS/20210604_distributed_token_bucket.md, line 242 at r5 (raw file):

## System table

The system table uses a "ledger" approach; each change to the state of a global

What happens when the rate changes by 10x at all 10 pods of the tenant and they come one at a time and execute their transaction? Will they get different shares based on their arrival order?

docs/RFCS/20210604_distributed_token_bucket.md, line 261 at r5 (raw file):

  ru_current FLOAT NOT NULL,
  current_share_sum FLOAT NOT NULL,

is each tenant row maintaining a share sum across all tenants? Or maybe there is a total order across all rows since the earlier text mentioned "ledger" -- I am not seeing how that total order comes about. Is it the seq?

docs/RFCS/20210604_distributed_token_bucket.md, line 281 at r5 (raw file):

BEGIN;
-- Get the latest state (with the largest sequence number).
SELECT FOR UPDATE * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;

does this tenant not need the state of other tenants to computes its share of the rate?

docs/RFCS/20210604_distributed_token_bucket.md, line 380 at r5 (raw file):

The system must have reasonable behavior if the bucket range becomes temporarily
inaccessible. To achieve this, in the short term each node continues operating

What is "short term" here?
This dependency on a fully functioning transactional system worries me, and even more so for a geo partitioned cluster.

This change adds the TokenBucket API proposed in the RFC (cockroachdb#66436), a stub implementation and client for it, and the corresponding KV connector interface. The client and server-side code lives in ccl/multitenantccl/tenantcostclient and tenantcostserver. Release note: None

67067: server: require admin role to access node status r=bdarnell a=knz Release note (security update): The node status retrieval endpoints over HTTP (`/_status/nodes`, `/_status/nodes/<N>` and the web UI `/#/reports/nodes`) have been updated to require the `admin` role from the requesting user. This ensures that operational details such as network addresses and command-line flags do not leak to unprivileged users. 67733: colexecbase: extend support of casts r=yuzefovich a=yuzefovich Addresses: #48135 See individual commits for details. After this PR we only need to add more casts between natively supported types. 67768: sql, server: add skeleton TokenBucket connector and tenant resource limits configuration APIs r=RaduBerinde a=RaduBerinde This PR is a scaled back version of #67508 where we don't use the system table at all. It's meant to put some of the infrastructure pieces in place and provide a stub API for reconfiguration. The plan is to add consumption metrics on top of this soon so that CC can develop in parallel. --- #### server: add TokenBucket connector API This change adds the TokenBucket API proposed in the RFC (#66436), a stub implementation and client for it, and the corresponding KV connector interface. The client and server-side code lives in ccl/multitenantccl/tenantcostclient and tenantcostserver. Release note: None #### sql: tenant resource limits configuration API This commit adds a `crdb_internal.update_tenant_resource_limits` internal SQL function (to be used by the system tenant) which updates the token bucket configuration for a specific tenant. Release note: None 67840: sql: add test for creating stats on tables with expression indexes r=mgartner a=mgartner Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

RaduBerinde

Updated to use Andrew's idea for the system table (after refining it together a bit more). This solves the problem of cleaning up the shares when an instance goes away; we no longer need the shares "decay". Also see #68115 where this is prototyped.

I still plan to address the open comments around more examples and calculations for how much KV churn we expect to generate.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, and @tbg)

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

@ajwerner

68115: tenantcostserver: use the tenant_usage system table r=RaduBerinde a=RaduBerinde This PR prototypes a new tenant_usage schema based on @ajwerner's suggestion in the RFC (#66436). I will update the RFC after getting some initial feedback here. #### sql: add tenant_usage system table Add the system table described in the RFC (#66436). The table is only created for the system tenant. Release note: None #### tenantcostserver: use the tenant_usage system table This change implements most of the interaction with the tenant_usage system table, with the exception of dead instances detection and clean-up. We currently tolerate an empty table, but it would be cleaner to initialize the tenant state (instance_id=0 row) at tenant creation time (+ implement a migration). I will explore this in a future change, when we add some configurable defaults for the refill rate etc. Release note: None Co-authored-by: Radu Berinde <[email protected]>

vy-ton · 2021-08-03T17:48:33Z

docs/RFCS/20210604_distributed_token_bucket.md

+ - a limit on how many unused tokens (RUs) we can accumulate as burst. Refill is
+   essentially paused when the bucket has more tokens than the limit.
+
+Each SQL pod implements a local token bucket and uses it for admission control


I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.

What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)

docs/RFCS/20210604_distributed_token_bucket.md, line 70 at r12 (raw file):

Previously, vy-ton (Vy Ton) wrote…

I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.

What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?

In its current incarnation, overload control does not reject operations, it just delays them.

But it is a very good question - if an operation required tokens and it later hits an error, we should not report those RUs as consumed.

Design of a subsystem relevant in the multi-tenant setting ("serverless") which rate limits tenant KV operations in conformance to a budget target. Release note: None

RaduBerinde

Added some back-of-the-envelope calculations for the workload on the system table.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)

docs/RFCS/20210604_distributed_token_bucket.md, line 120 at r5 (raw file):

Previously, sumeerbhola wrote…

+1 for including such back of the envelope calculations.
I think the number of versions is limited since each update is using a new seqnum, but it is worth calling out in the Performance considerations section.

Done.

RaduBerinde · 2021-09-16T18:57:18Z

I will merge this for now because I don't have much time to improve it in the immediate future.

bors r+

craig · 2021-09-16T20:01:51Z

Build succeeded:

GitHub CI (Cockroach)

RaduBerinde requested review from ajwerner, andy-kimball, kernfeld-cockroach and cucaroach June 14, 2021 16:36

RaduBerinde requested a review from a team as a code owner June 14, 2021 16:36

ajwerner reviewed Jun 14, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from 87e9369 to 37927f0 Compare June 14, 2021 18:27

RaduBerinde commented Jun 14, 2021

View reviewed changes

kernfeld-cockroach reviewed Jun 14, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from 37927f0 to 792a7fb Compare June 14, 2021 18:51

ajwerner reviewed Jun 14, 2021

View reviewed changes

RaduBerinde commented Jun 14, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from 792a7fb to f48b251 Compare June 14, 2021 19:11

ajwerner reviewed Jun 14, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from f48b251 to 575a3b1 Compare June 14, 2021 19:33

RaduBerinde commented Jun 14, 2021

View reviewed changes

cucaroach approved these changes Jun 15, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from 575a3b1 to ab36a35 Compare June 15, 2021 15:07

RaduBerinde commented Jun 15, 2021

View reviewed changes

kernfeld-cockroach reviewed Jun 15, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch 2 times, most recently from d82937e to ebeb800 Compare June 15, 2021 15:53

RaduBerinde commented Jun 15, 2021

View reviewed changes

kernfeld-cockroach reviewed Jun 15, 2021

View reviewed changes

RaduBerinde commented Jun 15, 2021

View reviewed changes

kernfeld-cockroach suggested changes Jun 15, 2021

View reviewed changes

RaduBerinde commented Jun 15, 2021

View reviewed changes

sumeerbhola requested changes Jun 15, 2021

View reviewed changes

RaduBerinde mentioned this pull request Jul 27, 2021

tenantcostserver: use the tenant_usage system table #68115

Merged

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 27, 2021

sql: add tenant_usage system table

fb79801

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 28, 2021

sql: add tenant_usage system table

513857c

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

RaduBerinde force-pushed the distbucket-rfc branch from 5e18817 to ef2acf9 Compare July 29, 2021 22:07

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 29, 2021

sql: add tenant_usage system table

cfd8b6c

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

RaduBerinde commented Jul 29, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from ef2acf9 to 6c278ea Compare July 29, 2021 22:15

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jul 30, 2021

sql: add tenant_usage system table

6366b2d

Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None

vy-ton reviewed Aug 3, 2021

View reviewed changes

RaduBerinde commented Aug 3, 2021

View reviewed changes

RaduBerinde force-pushed the distbucket-rfc branch from 6c278ea to 6e9ac40 Compare August 4, 2021 03:01

rfc: distributed token bucket RFC

10748eb

Design of a subsystem relevant in the multi-tenant setting ("serverless") which rate limits tenant KV operations in conformance to a budget target. Release note: None

RaduBerinde force-pushed the distbucket-rfc branch from 6e9ac40 to 10748eb Compare August 4, 2021 03:02

RaduBerinde commented Aug 4, 2021

View reviewed changes

craig bot merged commit d3a0546 into cockroachdb:master Sep 16, 2021

RaduBerinde deleted the distbucket-rfc branch September 21, 2021 20:47

irfansharif mentioned this pull request Sep 22, 2021

allocator: write a discrete event simulator #70552

Closed

ajwerner mentioned this pull request Sep 22, 2021

kvserver: distributed tenant cost control #55115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: distributed token bucket RFC #66436

rfc: distributed token bucket RFC #66436

RaduBerinde commented Jun 14, 2021 •

edited by irfansharif

Loading

cockroach-teamcity commented Jun 14, 2021

ajwerner left a comment

RaduBerinde left a comment

kernfeld-cockroach left a comment

RaduBerinde commented Jun 14, 2021

ajwerner left a comment

RaduBerinde left a comment

ajwerner left a comment

RaduBerinde left a comment

cucaroach left a comment

RaduBerinde left a comment

kernfeld-cockroach left a comment

RaduBerinde left a comment

kernfeld-cockroach left a comment

RaduBerinde left a comment

kernfeld-cockroach left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

vy-ton Aug 3, 2021

RaduBerinde left a comment

RaduBerinde left a comment

RaduBerinde commented Sep 16, 2021

craig bot commented Sep 16, 2021

rfc: distributed token bucket RFC #66436

rfc: distributed token bucket RFC #66436

Conversation

RaduBerinde commented Jun 14, 2021 • edited by irfansharif Loading

cockroach-teamcity commented Jun 14, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

kernfeld-cockroach left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jun 14, 2021

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

cucaroach left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

kernfeld-cockroach left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

kernfeld-cockroach left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

kernfeld-cockroach left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

vy-ton Aug 3, 2021

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Sep 16, 2021

craig bot commented Sep 16, 2021

RaduBerinde commented Jun 14, 2021 •

edited by irfansharif

Loading