-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: distributed token bucket RFC #66436
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursory first pass. I haven't internalized the shares yet.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):
BEGIN; -- Get the latest state (with the largest sequence number). SELECT * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;
nit: FOR UPDATE
.
docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):
```sql SELECT seq FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1 OFFSET 1000;
We have reverse scanning, why not go from the bottom up?
docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):
Quoted 4 lines of code…
TODO: it's unfortunate that we will have to talk to two ranges on each operation. Could we achieve this with a single index? We could reserve part of the UUID space for the sequence number, allowing one index to serve a dual-purpose, but things would become messy.
Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.
Consider a table that has the primary key as just the tenant_id
and stores the cumulative stats and then a ledger which is keyed on (tenant_id, op_id)
but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.
docs/RFCS/20210604_distributed_token_bucket.md, line 325 at r1 (raw file):
) ( updateTimestamp time.Time, grantedTokens time.Time,
is this type right?
87e9369
to
37927f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):
Previously, ajwerner wrote…
nit:
FOR UPDATE
.
We're not updating any existing row, does FOR UPDATE still help in this case?
docs/RFCS/20210604_distributed_token_bucket.md, line 292 at r1 (raw file):
Previously, ajwerner wrote…
We have reverse scanning, why not go from the bottom up?
I guess since these things are sequential, we can just read the latest one and subtract 1000. Updated.
docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):
Previously, ajwerner wrote…
TODO: it's unfortunate that we will have to talk to two ranges on each operation. Could we achieve this with a single index? We could reserve part of the UUID space for the sequence number, allowing one index to serve a dual-purpose, but things would become messy.
Two total indexes do seem in order. However, I don't know that we need to have two indexes over all of the ledger events. At the end of the day, we're going to be totally ordering all of the updates. What if we had two tables, but one of them (the current state) ends up just being a single-row table. That way, with some work we're actively doing not, that might possible end up being inside the same range if we don't mandate splits on table boundaries and all of this stays small. I do get that this all, on some level, is begging for interleaving.
Consider a table that has the primary key as just the
tenant_id
and stores the cumulative stats and then a ledger which is keyed on(tenant_id, op_id)
but stores the sequence number. Would that be better? During GC, at least you'd only need to write to one table.
It feels roughly equivalent to me. Updating a single row is no better than inserting new rows (given that we're keeping all versions internally). The GC just happens at another level.
I don't see how that would help with single range though - we have many tenants, so we would need to interleave the two tables. I guess we could create a table per tenant but that could be a lot of tables (and I think the proposed schema would work fine too in that case).
A variation on your idea would be to use a (tenant_id, op_id)
primary key and store the current state in (tenant_id, 0)
. That would keep things to a single range if any single tenant's data stays small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file):
```sql SELECT crdb_internal(
Do we need to give this SQL function a name? Otherwise this function looks good.
37927f0
to
792a7fb
Compare
docs/RFCS/20210604_distributed_token_bucket.md, line 346 at r2 (raw file): Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
Oops, done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):
Previously, RaduBerinde wrote…
We're not updating any existing row, does FOR UPDATE still help in this case?
Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.
docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):
A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.
Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.
docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):
idempotency). The sequence numbers allow locating the latest entry. Schema for the system table:
One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 277 at r1 (raw file):
Previously, ajwerner wrote…
Aren't we going to increment the sequence number to generate the next row? It would be a bummer to have two actors reads the current max sequence, then both try to write the next sequence, one wins and the other waits, then writes, then refreshes due to the WTO, then fails to refresh and restarts.
I see, so FOR UPDATE would force these two txn to deal with each other eariler. Done.
docs/RFCS/20210604_distributed_token_bucket.md, line 301 at r1 (raw file):
Previously, ajwerner wrote…
A variation on your idea would be to use a (tenant_id, op_id) primary key and store the current state in (tenant_id, 0). That would keep things to a single range if any single tenant's data stays small.
Yes, that makes sense. I was thinking if the whole table remains small (as I expect it to), then it could still fit in one-ish range. Imagine 10k tenants, 500 entries per tenant 64 bytes per entry = 320 MiB which is one range. We don't want to rely on that and those numbers are likely too small. On some level this feels like pre-mature optimization.
I don't know if that would be desirable - it might be a bottleneck since all tenants would be hitting the same range all the time. I'd expect the table to split as necessary based on load (even if it would otherwise fit in one range).
docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):
Previously, ajwerner wrote…
One thing to start thinking about is cold startup time for a new pod in a new region. @andy-kimball has been making noise that some time in the year 2022 we'd like to be able to spin up a SQL pod and serve a query without any global round-trips. What would it take to achieve that goal in this subsystem?
I think we could start with an initial amount of tokens and treat that as debt in the first request.
792a7fb
to
f48b251
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):
Previously, RaduBerinde wrote…
I think we could start with an initial amount of tokens and treat that as debt in the first request.
Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.
f48b251
to
575a3b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 244 at r3 (raw file):
Previously, ajwerner wrote…
Nice, maybe worthy of a comment somewhere that we can boot instances with some initial tokens to lower latency of startup.
Done, mentioned it under the "Initial amount" knob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @kernfeld-cockroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):
- if there are sufficient "burst" tokens already in the bucket, the tokens are granted immediately; - otherwise, a fraction of the global refill rate is granted to the node and
How is the fraction determined? Is it the number of tenants with a current share > 0?
575a3b1
to
ab36a35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 99 at r4 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
How is the fraction determined? Is it the number of tenants with a current share > 0?
I added to the paragraph below. It's just the ratio of node shares to total shares.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):
To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:
Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.
docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):
## Configuration API
What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?
docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):
```sql SELECT crdb_internal.update_tenant_resource_limits(
How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.
docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):
- the refill that would have happened in the delta time period. ## Resilience considerations
Is my understanding correct that, if crdb_internal.update_tenant_resource_limits
isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
inaccessible. To achieve this, in the short term each node continues operating at the previous rate of consumption. Longer term, the rate can decay over time toward 1/N of the total refill rate, where N is the number of SQL pods.
How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.
d82937e
to
ebeb800
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 243 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
To clarify, multiple tenant usages can have the same sequence number as long as they have different tenant IDs, right? If so, I'd write like:
Each change has a unique operation ID. For any given tenant, each change has a unique a sequence number which allows detection of duplicate requests, to ensure
idempotency. The sequence numbers allow locating the latest entry for each tenant.
Done.
docs/RFCS/20210604_distributed_token_bucket.md, line 246 at r4 (raw file):
idempotency). The sequence numbers allow locating the latest entry. Schema for the system table:
To give credit where it is due, this ledger approach was suggested by @andy-kimball
docs/RFCS/20210604_distributed_token_bucket.md, line 341 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
What mechanism should be used for initial configuration? Just this same function? What would happen if we didn't call this?
Added
docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
How quickly would this function return? Does it wait until the new entry has been inserted into the table? Is there any advantage to calling this function in a batch? That way we could perhaps lock the table once, update every bucket, and then release the lock.
Yes it would return after the entry is inserted.
I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing
SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)
docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
Is my understanding correct that, if
crdb_internal.update_tenant_resource_limits
isn't called for a while, the buckets will continue to operate at a constant capacity, right? I think that's an important property to have in case there is a problem with the code path responsible for continually adjusting the capacity.
I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
How are we going to evaluate the number of SQL pods? Something like, the number of SQL pods known to have done operations within the past 60 seconds? CockroachCloud could provide a hint as to how many SQL pods there should be, but I think that would probably be unnecessary complexity.
I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):
Previously, RaduBerinde wrote…
Yes it would return after the entry is inserted.
I think we should plan to call it once per tenant in the first implementation and investigate batching later. It's not immediately obvious to me that changing all tenants in one transaction is better. Depending on what we decide, it may be possible to implement without adding more syntax, by doing
SELECT crdb_internal.update_tenant_resource_limits(a,b,c,d,..) FROM (VALUES (...)) AS v(a,b,c,d..)
Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?
docs/RFCS/20210604_distributed_token_bucket.md, line 366 at r4 (raw file):
Previously, RaduBerinde wrote…
I'm not understanding "constant capacity". The bucket will continue with the last settings; it will keep using RUs as available and it will continue refilling at the same rate.
Okay great, that's what I was trying to say.
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
Previously, RaduBerinde wrote…
I am assuming here that we will be able to tell from within the SQL pod how many other SQL pods there are in your cluster. I'm betting that we would need this for other things anyway (like DistSQL or showing per-node info in the UI or virtual tables).
I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.
docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):
-- Calculate new state. -- Set the new state, using the next sequence number. INSERT INTO tenant_usage (tenant_id, seq, ...) (tenant_id, seq+1, ...);
From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 348 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
Yep, sounds great. Am I correct that updates for different tenants can be processed largely in parallel?
Yes.
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
I agree that this assumption will eventually be true, but I also think it's important to have a strategy for how pods can be counted right now or in the very near future.
In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).
docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
From the CRDB docs, my sense is that generating sequence numbers by selecting and adding 1 is not the most effective way to leverage CRDB's performance, because we might encounter contention on a few ranges. Is there a way to use a strategy from How do I generate unique, slowly increasing sequential numbers in CockroachDB? instead?
We need to read the current bucket state no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
Previously, RaduBerinde wrote…
In the near future, we can assume 1 node (the worst consequence would be that we allow too much resource usage while the bucket is unavailable).
I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.
docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):
Previously, RaduBerinde wrote…
We need to read the current bucket state no matter what.
That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, and @kernfeld-cockroach)
docs/RFCS/20210604_distributed_token_bucket.md, line 371 at r4 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
I agree with that. But what about the also pretty near future where there may be multiple SQL pods per tenant? I would be satisfied even if we identified which team is responsible for delivering pod-counting functionality.
docs/RFCS/20210604_distributed_token_bucket.md, line 284 at r5 (raw file):
Previously, kernfeld-cockroach (Paul Kernfeld) wrote…
That makes sense. I imagine we could get away with reading slightly stale bucket state if we thought that would help. But it also sounds like, at the scale of expected updates we aren't at all concerned with update performance here, right?
Hm, no, it's critical that we read the most current state (or we'd lose track of consumed units, among other issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @kernfeld-cockroach, and @RaduBerinde)
docs/RFCS/20210604_distributed_token_bucket.md, line 242 at r5 (raw file):
## System table The system table uses a "ledger" approach; each change to the state of a global
What happens when the rate changes by 10x at all 10 pods of the tenant and they come one at a time and execute their transaction? Will they get different shares based on their arrival order?
docs/RFCS/20210604_distributed_token_bucket.md, line 261 at r5 (raw file):
ru_current FLOAT NOT NULL, current_share_sum FLOAT NOT NULL,
is each tenant row maintaining a share sum across all tenants? Or maybe there is a total order across all rows since the earlier text mentioned "ledger" -- I am not seeing how that total order comes about. Is it the seq
?
docs/RFCS/20210604_distributed_token_bucket.md, line 281 at r5 (raw file):
BEGIN; -- Get the latest state (with the largest sequence number). SELECT FOR UPDATE * FROM tenant_usage WHERE tenant_id=.. ORDER BY seq DESC LIMIT 1;
does this tenant not need the state of other tenants to computes its share of the rate?
docs/RFCS/20210604_distributed_token_bucket.md, line 380 at r5 (raw file):
The system must have reasonable behavior if the bucket range becomes temporarily inaccessible. To achieve this, in the short term each node continues operating
What is "short term" here?
This dependency on a fully functioning transactional system worries me, and even more so for a geo partitioned cluster.
This change adds the TokenBucket API proposed in the RFC (cockroachdb#66436), a stub implementation and client for it, and the corresponding KV connector interface. The client and server-side code lives in ccl/multitenantccl/tenantcostclient and tenantcostserver. Release note: None
This change adds the TokenBucket API proposed in the RFC (cockroachdb#66436), a stub implementation and client for it, and the corresponding KV connector interface. The client and server-side code lives in ccl/multitenantccl/tenantcostclient and tenantcostserver. Release note: None
67067: server: require admin role to access node status r=bdarnell a=knz Release note (security update): The node status retrieval endpoints over HTTP (`/_status/nodes`, `/_status/nodes/<N>` and the web UI `/#/reports/nodes`) have been updated to require the `admin` role from the requesting user. This ensures that operational details such as network addresses and command-line flags do not leak to unprivileged users. 67733: colexecbase: extend support of casts r=yuzefovich a=yuzefovich Addresses: #48135 See individual commits for details. After this PR we only need to add more casts between natively supported types. 67768: sql, server: add skeleton TokenBucket connector and tenant resource limits configuration APIs r=RaduBerinde a=RaduBerinde This PR is a scaled back version of #67508 where we don't use the system table at all. It's meant to put some of the infrastructure pieces in place and provide a stub API for reconfiguration. The plan is to add consumption metrics on top of this soon so that CC can develop in parallel. --- #### server: add TokenBucket connector API This change adds the TokenBucket API proposed in the RFC (#66436), a stub implementation and client for it, and the corresponding KV connector interface. The client and server-side code lives in ccl/multitenantccl/tenantcostclient and tenantcostserver. Release note: None #### sql: tenant resource limits configuration API This commit adds a `crdb_internal.update_tenant_resource_limits` internal SQL function (to be used by the system tenant) which updates the token bucket configuration for a specific tenant. Release note: None 67840: sql: add test for creating stats on tables with expression indexes r=mgartner a=mgartner Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None
Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None
5e18817
to
ef2acf9
Compare
Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use Andrew's idea for the system table (after refining it together a bit more). This solves the problem of cleaning up the shares when an instance goes away; we no longer need the shares "decay". Also see #68115 where this is prototyped.
I still plan to address the open comments around more examples and calculations for how much KV churn we expect to generate.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, and @tbg)
ef2acf9
to
6c278ea
Compare
Add the system table described in the RFC (cockroachdb#66436). The table is only created for the system tenant. Release note: None
68115: tenantcostserver: use the tenant_usage system table r=RaduBerinde a=RaduBerinde This PR prototypes a new tenant_usage schema based on @ajwerner's suggestion in the RFC (#66436). I will update the RFC after getting some initial feedback here. #### sql: add tenant_usage system table Add the system table described in the RFC (#66436). The table is only created for the system tenant. Release note: None #### tenantcostserver: use the tenant_usage system table This change implements most of the interaction with the tenant_usage system table, with the exception of dead instances detection and clean-up. We currently tolerate an empty table, but it would be cleaner to initialize the tenant state (instance_id=0 row) at tenant creation time (+ implement a migration). I will explore this in a future change, when we add some configurable defaults for the refill rate etc. Release note: None Co-authored-by: Radu Berinde <[email protected]>
- a limit on how many unused tokens (RUs) we can accumulate as burst. Refill is | ||
essentially paused when the bucket has more tokens than the limit. | ||
|
||
Each SQL pod implements a local token bucket and uses it for admission control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.
- What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)
docs/RFCS/20210604_distributed_token_bucket.md, line 70 at r12 (raw file):
Previously, vy-ton (Vy Ton) wrote…
I understand that this RFC does not cover overload admission control, but I am interested in the intersection between the 2 subsystems.
- What happens if a SQL tenant uses tokens to only be rejected by the overload admission control?
In its current incarnation, overload control does not reject operations, it just delays them.
But it is a very good question - if an operation required tokens and it later hits an error, we should not report those RUs as consumed.
6c278ea
to
6e9ac40
Compare
Design of a subsystem relevant in the multi-tenant setting ("serverless") which rate limits tenant KV operations in conformance to a budget target. Release note: None
6e9ac40
to
10748eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some back-of-the-envelope calculations for the workload on the system table.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @joshimhoff, @knz, @RaduBerinde, @sumeerbhola, @tbg, and @vy-ton)
docs/RFCS/20210604_distributed_token_bucket.md, line 120 at r5 (raw file):
Previously, sumeerbhola wrote…
+1 for including such back of the envelope calculations.
I think the number of versions is limited since each update is using a new seqnum, but it is worth calling out in the Performance considerations section.
Done.
I will merge this for now because I don't have much time to improve it in the immediate future. bors r+ |
Build succeeded: |
Link to RFC text.
This is an initial draft.
Design of a subsystem relevant in the multi-tenant setting
("serverless") which rate limits tenant KV operations in conformance
to a budget target.
Release note: None