rfcs: multitenant cluster settings #73349

RaduBerinde · 2021-12-01T21:57:48Z

Release note: None

cockroach-teamcity · 2021-12-01T21:57:56Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 45 at r1 (raw file):

 - in certain cases tenant code may need to consult values for cluster settings
   that apply to the host cluster: for example
   `kv.closed_timestamp.follower_reads_enabled` applies to the KV subsystem but

I think that this setting is a good example of something that probably shouldn't be a cluster setting but rather should be a zone configuration. See #70614

This is sort of irrelevant to this proposal; anything we move to zone/span configs, we don't need to think about as settings. However, @irfansharif and @arulajmani, we do need to think about how the tenants can read the defaults coming out of the host tenant that apply to them.

arulajmani

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 45 at r1 (raw file):

we do need to think about how the tenants can read the defaults coming out of the host tenant that apply to them.

This would be a natural extension to what @adityamaru and I are planning on proposing in the (soon to be out) protected timestamp RFC. We're proposing a way to allow the host tenant to configure span configuration fields that apply to all/particular secondary tenants. The idea is to do this by introducing a span configuration hierarchy in the host tenant that is hydrated on demand. This can be extended to give the host tenant the ability to set defaults or overrides for particular fields -- seems like kv.closed_timestamp.follower_reads_enabled could use this if we build a listening mechanism similar in principle to what is described for System visible below.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

   An alternative approach previously suggested in #68406 would be to store
   these settings on the host side (for all tenants, in a new system table).
   However, this would be considerably more work.

Hm. Considerably more initial work perhaps -- though even then, I'm not sure that it is that bad if we're already extending the TenantConnector above, to provide System-visible settings to the tenant, to go ahead and also hook it up to a scan of tenant-specific settings values). But what I think might be missing from this "considerably more work" determination is that we will have an ongoing cost to maintain the new requirement that there are keys in the tenant's SQL table keyspace that the tenant cannot write by any means. This seems a pretty significant, and easy to mess up (e.g. as mentioned above via RESTORE, or who knows what else we might overlook). I'm very apprehensive about this aspect of this proposal.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

4. Tenant

   This class is consistent with the current semantics (it's already

If we're building out the tenant connector to provide setting values from the host cluster anyway, and we extend that implementation as I suggest above to include looking up tenant-specific values, I think we could then have a potentially really useful feature for settings in this class as well, which would be the ability for the host cluster to choose at runtime to override any of these settings with a host-set value, essentially by creating an override for tenant X of setting Y, it would temporarily move that setting from this class up to tenant read-only. This seems like it could be quite useful to SREs/operators of MT clusters to respond to incidents until a code change can be deployed.

Indeed, I almost wonder if these two classes even need to be distinct in the code or if the difference between tenant r/w and tenant r/o is just whether or not a host-cluster override exists.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Hm. Considerably more initial work perhaps -- though even then, I'm not sure that it is that bad if we're already extending the TenantConnector above, to provide System-visible settings to the tenant, to go ahead and also hook it up to a scan of tenant-specific settings values). But what I think might be missing from this "considerably more work" determination is that we will have an ongoing cost to maintain the new requirement that there are keys in the tenant's SQL table keyspace that the tenant cannot write by any means. This seems a pretty significant, and easy to mess up (e.g. as mentioned above via RESTORE, or who knows what else we might overlook). I'm very apprehensive about this aspect of this proposal.

It's not the initial scan I care about, it's keeping it up-to-date that I care about. What's the proposal there? We add a streaming RPC backed by a rangefeed? That's not crazy.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

It's not the initial scan I care about, it's keeping it up-to-date that I care about. What's the proposal there? We add a streaming RPC backed by a rangefeed? That's not crazy.

we're already proposing that having streaming RPC, for the system-visible ones above, so I'm just proposing it also include tenant-specific values. Presumably that RPC is already backed by a rangefeed on system.settings, for the system-visible class, so we could either just have a second rangefeed, on a new guest-tenant-settings table, or we could figure out a way to have guest tenant settings overrides stored in the host's system.settings (name prefix, new column, etc).

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we're already proposing that having streaming RPC, for the system-visible ones above, so I'm just proposing it also include tenant-specific values. Presumably that RPC is already backed by a rangefeed on system.settings, for the system-visible class, so we could either just have a second rangefeed, on a new guest-tenant-settings table, or we could figure out a way to have guest tenant settings overrides stored in the host's system.settings (name prefix, new column, etc).

I suppose that that's true. When we first talked about it, I was thinking we could hijack the existing gossip connector, but I want to kill that thing so yeah.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

I suppose that that's true. When we first talked about it, I was thinking we could hijack the existing gossip connector, but I want to kill that thing so yeah.

I agree about the ongoing cost.

One aspect to consider is how unified the code and logic for the system and non-system tenants will be. Using the tenant's settings table relies on most of the existing system tenant code. We would probably want to switch the system tenant to use the same mechanism for its settings.

There is something to the idea of somehow extending the settings table, I will give that some thought.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

Previously, dt (David Taylor) wrote…

If we're building out the tenant connector to provide setting values from the host cluster anyway, and we extend that implementation as I suggest above to include looking up tenant-specific values, I think we could then have a potentially really useful feature for settings in this class as well, which would be the ability for the host cluster to choose at runtime to override any of these settings with a host-set value, essentially by creating an override for tenant X of setting Y, it would temporarily move that setting from this class up to tenant read-only. This seems like it could be quite useful to SREs/operators of MT clusters to respond to incidents until a code change can be deployed.

Indeed, I almost wonder if these two classes even need to be distinct in the code or if the difference between tenant r/w and tenant r/o is just whether or not a host-cluster override exists.

How is this functionality different from the proposed SET TENANT CLUSTER SETTING? Is it that not only can the host change the setting, but it can also (temporarily) prevent the tenant from modifying it?

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

Previously, RaduBerinde wrote…

How is this functionality different from the proposed SET TENANT CLUSTER SETTING? Is it that not only can the host change the setting, but it can also (temporarily) prevent the tenant from modifying it?

right -- I'm proposing the host cluster can set any setting using that syntax, not just those explicitly put in the "tenant read-only" class ahead of time, since the tenant pod would always read the host's exported override before its own local value. Thus SET TENANT CLUSTER SETTING could be used in an incident to override any setting, and it wouldn't matter if the guest tenant had their own value/tried to change it/etc.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

I agree about the ongoing cost.

One aspect to consider is how unified the code and logic for the system and non-system tenants will be. Using the tenant's settings table relies on most of the existing system tenant code. We would probably want to switch the system tenant to use the same mechanism for its settings.

There is something to the idea of somehow extending the settings table, I will give that some thought.

In this "host-side" proposal - how do tenants change a cluster setting? A tenant connector API?

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

In this "host-side" proposal - how do tenants change a cluster setting? A tenant connector API?

Oh, I was proposing host-side storage only for the host-set values -- tenant-set values would still be written to their local system.settings table and that value would be read into their settings.Values, all just like it is today.

What I was suggesting is if the host wants to force a given setting to have a given value for a specific tenant, it can write that to some table in the host key span, either a special tenant settings table or some prefix of its system.settings table, and then that value flows over the tenant connector to the tenant (just like "host-visible" settings are described as doing above), and this would cause the in-mem setting updater, in the tenant pod, to use that value instead of looking in its local table.

Basically today, the value of a setting is if (has value in system.settings)? { that value } else { default } and I'd extend that to be if (got value for this setting from host via connector?) { that value } else if (has value in (my) system.settings?) {that value} else { default }.

With that change to the settings updater, we could implement permanent tenant read-only settings (just write a value in the host) that are reliably stored out of mutation reach of the tenant, as well as temporary overrides for any (normally tenant-setable) tenant setting which seems nice.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Oh, I was proposing host-side storage only for the host-set values -- tenant-set values would still be written to their local system.settings table and that value would be read into their settings.Values, all just like it is today.

What I was suggesting is if the host wants to force a given setting to have a given value for a specific tenant, it can write that to some table in the host key span, either a special tenant settings table or some prefix of its system.settings table, and then that value flows over the tenant connector to the tenant (just like "host-visible" settings are described as doing above), and this would cause the in-mem setting updater, in the tenant pod, to use that value instead of looking in its local table.

Basically today, the value of a setting is if (has value in system.settings)? { that value } else { default } and I'd extend that to be if (got value for this setting from host via connector?) { that value } else if (has value in (my) system.settings?) {that value} else { default }.

With that change to the settings updater, we could implement permanent tenant read-only settings (just write a value in the host) that are reliably stored out of mutation reach of the tenant, as well as temporary overrides for any (normally tenant-setable) tenant setting which seems nice.

Wouldn't it be simpler though to not maintain two separate mechanisms/locations? If we allow the ability to store any setting on the host side, why not store all of them there?

So I guess I'm currently favoring this design:

system.settings becomes a table that only exists in / is only used by the system tenant.
the table is enhanced to be able to store per-tenant values of certain settings. We could do this e.g. by prepending the tenant ID to the cluster setting name.
on each KV node, the system tenant sets up a range feed on the entire table and maintains all setting values (for all tenants).
the tenant connector has new APIs to listen for changes to cluster settings, and to set cluster settings.

Thoughts on this?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

Wouldn't it be simpler though to not maintain two separate mechanisms/locations? If we allow the ability to store any setting on the host side, why not store all of them there?

So I guess I'm currently favoring this design:

system.settings becomes a table that only exists in / is only used by the system tenant.

the table is enhanced to be able to store per-tenant values of certain settings. We could do this e.g. by prepending the tenant ID to the cluster setting name.

on each KV node, the system tenant sets up a range feed on the entire table and maintains all setting values (for all tenants).

the tenant connector has new APIs to listen for changes to cluster settings, and to set cluster settings.

Thoughts on this?

What's the rate limiting/abuse prevention mechanism?

Some reason to keep two copies are:

there's never any risk of a tenant trying to write to the system tenant
- Helps with the abuse prevention
we can't currently split the settings table and every time we write to it it gets gossipped (we do plan to stop this this release but it seems bad to couple this architectural change)
- Settings are usually small, so this probably doesn't matter too too much.
It makes it possible to tell a user story where the user has had a setting explicitly overridden; there's no possibility of a game where we try to set something and they try to set it back
- I worry about cases where we made the wrong choice about whether we want to let the client set it.

andy-kimball · 2021-12-02T23:47:34Z

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

What's the rate limiting/abuse prevention mechanism?

Some reason to keep two copies are:

there's never any risk of a tenant trying to write to the system tenant

Helps with the abuse prevention

we can't currently split the settings table and every time we write to it it gets gossipped (we do plan to stop this this release but it seems bad to couple this architectural change)

Settings are usually small, so this probably doesn't matter too too much.

It makes it possible to tell a user story where the user has had a setting explicitly overridden; there's no possibility of a game where we try to set something and they try to set it back

I worry about cases where we made the wrong choice about whether we want to let the client set it.

I think backup/restore would mess this up. Per-tenant settings should be transactional (right?), in which case if I restore to an earlier version of my tenant cluster, I'd expect all per-tenant settings to be reset to what's in the backup. But if we're storing the per-tenant settings in the system tenant, this won't work properly.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Per-tenant settings should be transactional (right?),

That is a really big deal I forgot about, the bookkeeping for migrations is, at its core, rooted in the cluster setting state for the version key. We need some amount of control of atomicity on that.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

Per-tenant settings should be transactional (right?),

That is a really big deal I forgot about, the bookkeeping for migrations is, at its core, rooted in the cluster setting state for the version key. We need some amount of control of atomicity on that.

Hm. I sorta like the separation that two places gives us, in that it means tenant-written thing remained confined to the tenant span, while host-written things are never in the tenant span. That separation seems nice for a couple reasons: a backup of the tenant span captures all settings they set for example along with everything else they’ve written. It also means we don’t need to worry about an abuse angle, where you could store data in settings without it counting towards your total stored data, and being tenant-initiated writes being confined to writing to tenant spans also sidesteps questions like Dos-ing the host’s table with big requests or something. I’m sure we could mitigate each of these — teach backup to go find more tenant-owned data outside the tenant, add rate limits and such — but it seems like just confining tenant writes to tenant span as a hard rule is simpler to maintain, at the cost of yeah, reading two things.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 73 at r1 (raw file):

   tenants see the same value for these settings. Changing one of these settings
   (from the system tenant) results in all tenants (asynchronously) getting the
   updated value.

Are these also read-only then? If so, I'd add that explicitly in the text.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 83 at r1 (raw file):

   via a new statement:

 SET TENANT <tenant> CLUSTER SETTING <setting> = <value>

I'm not sure how much control we have, but I think this would be more consistent with how SQL syntax works elsewhere:

SET CLUSTER SETTING <setting> = <value> [FOR TENANT <tenant>]

docs/RFCS/20211106_multitenant_cluster_settings.md, line 85 at r1 (raw file):

     SET TENANT <tenant> CLUSTER SETTING <setting> = <value>
We will also add a SHOW [ALL] TENANT <tenant> CLUSTER SETTINGS statement.

Similarly:

SHOW [ALL] CLUSTER SETTINGS [FOR TENANT <tenant>]

docs/RFCS/20211106_multitenant_cluster_settings.md, line 115 at r1 (raw file):

   There is not much work to do here, other than fully hiding these settings
   from non-system tenants. The cluster settings subsystem will not allow
   accessing these values from a tenant process (it will crash the tenant

This sounds like just trying to get or set these values will crash the process, which doesn't sound right? There shouldn't be a way within SQL to force crash a process.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 130 at r1 (raw file):

   Values for these settings do not exist in tenant instances of the
   `system.settings` table. They only live in the system tenant's instance, with

How would SHOW ALL CLUSTER SETTINGS work? Would it show "System visible" setting as well as tenant settings? Does it have to union the settings in the tenant's system.settings table with ones it gets from the KV connector?

docs/RFCS/20211106_multitenant_cluster_settings.md, line 180 at r1 (raw file):

infrastructure as for `Tenant read-only` settings. This has the advantage of not
having to implement any new infrastructure, but it would complicate things on
the host side - when changing any such setting, we would have to change the

Is this so bad? How often do we change/add these settings? Could we use long-running migrations when we do (assuming it's very infrequent)?

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Hm. I sorta like the separation that two places gives us, in that it means tenant-written thing remained confined to the tenant span, while host-written things are never in the tenant span. That separation seems nice for a couple reasons: a backup of the tenant span captures all settings they set for example along with everything else they’ve written. It also means we don’t need to worry about an abuse angle, where you could store data in settings without it counting towards your total stored data, and being tenant-initiated writes being confined to writing to tenant spans also sidesteps questions like Dos-ing the host’s table with big requests or something. I’m sure we could mitigate each of these — teach backup to go find more tenant-owned data outside the tenant, add rate limits and such — but it seems like just confining tenant writes to tenant span as a hard rule is simpler to maintain, at the cost of yeah, reading two things.

I agree that we should keep per-tenant stuff in the tenant keyspace. Not doing so will lead us down a similar dark path to the one we've been trying to escape by separating logical SQL info from physical KV info as part of the multi-tenancy project. System tenant info belongs in the system tenant keyspace; regular tenant info belongs in the regular tenants' keyspaces. Otherwise, we have to do all sorts of nasty fixup when we backup/restore, stream clusters from one host to another, etc.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 115 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This sounds like just trying to get or set these values will crash the process, which doesn't sound right? There shouldn't be a way within SQL to force crash a process.

They would not exist as far as the user is concerned (i.e. they would get an "invalid cluster setting" error).

docs/RFCS/20211106_multitenant_cluster_settings.md, line 130 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How would SHOW ALL CLUSTER SETTINGS work? Would it show "System visible" setting as well as tenant settings? Does it have to union the settings in the tenant's system.settings table with ones it gets from the KV connector?

Yes, it would union them.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 180 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Is this so bad? How often do we change/add these settings? Could we use long-running migrations when we do (assuming it's very infrequent)?

The problem is also that we have to keep them in sync, e.g. if a backup restores older values. Anyway, I think it's moot, because it looks like we're moving in the direction of tenant read-only also being stored on the host side anyway

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I agree that we should keep per-tenant stuff in the tenant keyspace. Not doing so will lead us down a similar dark path to the one we've been trying to escape by separating logical SQL info from physical KV info as part of the multi-tenancy project. System tenant info belongs in the system tenant keyspace; regular tenant info belongs in the regular tenants' keyspaces. Otherwise, we have to do all sorts of nasty fixup when we backup/restore, stream clusters from one host to another, etc.

I'm having some trouble following this thread. I've tried summarizing something below, with some questions around how many places a setting's value can be found. Is this what we're thinking?

system.settings exists for all tenants, including the host tenant. It stores within in all tenant read+writable cluster settings. Example: sql.notices.enabled. For the host tenant, it would also include the "system hidden" settings, such as kv.allocator.qps_rebalance_threshold.
system.tenant_settings (or similar) to store tenant-scoped read-only settings. Table exists only on the host tenant, writable only by the host tenant. Each tenant pod would maintain a rangefeed to the relevant portion of the table to asynchronously received such updates. In each tenant pod, when reading the value of a cluster setting, it would stitch the values from (1) and (2) together. Are these settings disjoint? Or do we want the value from the host tenant to act as a break-the-glass override?
system.global_settings (or similar) on the host tenant storing "system visible" settings, visible to all secondary tenant through the connector/rangefeed. For the host tenant, it's just another table to consult in addition to system.global_settings. Are the settings in this table are disjoint from (1) and (2)? Or do it want it to also act as a global override of sorts?

If so, SGTM. It gives us txn-al semantics for all tenants (secondary or otherwise) over the settings it can write to. Backup/restores also work as you'd expect, affecting only the tenant's keyspace. When restoring the host tenant, I'm not sure if we restore secondary tenant metadata (system.tenants etc.) -- we must, right? That too would behave as expected. Fits in well with our thread model.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm having some trouble following this thread. I've tried summarizing something below, with some questions around how many places a setting's value can be found. Is this what we're thinking?

system.settings exists for all tenants, including the host tenant. It stores within in all tenant read+writable cluster settings. Example: sql.notices.enabled. For the host tenant, it would also include the "system hidden" settings, such as kv.allocator.qps_rebalance_threshold.

system.tenant_settings (or similar) to store tenant-scoped read-only settings. Table exists only on the host tenant, writable only by the host tenant. Each tenant pod would maintain a rangefeed to the relevant portion of the table to asynchronously received such updates. In each tenant pod, when reading the value of a cluster setting, it would stitch the values from (1) and (2) together. Are these settings disjoint? Or do we want the value from the host tenant to act as a break-the-glass override?

system.global_settings (or similar) on the host tenant storing "system visible" settings, visible to all secondary tenant through the connector/rangefeed. For the host tenant, it's just another table to consult in addition to system.global_settings. Are the settings in this table are disjoint from (1) and (2)? Or do it want it to also act as a global override of sorts?

If so, SGTM. It gives us txn-al semantics for all tenants (secondary or otherwise) over the settings it can write to. Backup/restores also work as you'd expect, affecting only the tenant's keyspace. When restoring the host tenant, I'm not sure if we restore secondary tenant metadata (system.tenants etc.) -- we must, right? That too would behave as expected. Fits in well with our thread model.

Yes. Every tenant, system included, has a system.settings table that they can read from and write to -- it is theirs.
/ 3) I think are TBD on exactly whether or not those are distinct from each other, or even if they are distinct from the system tenant's system.settings if it learned to store additional copies of settings it wanted to apply to guest tenants. I don't think we need a separate global_settings at least.

My straw man is we make one new system table and it looks something like this:
CREATE TABLE system.tenant_setting_overrides (tenant_id INT NOT NULL, setting STRING NOT NULL, value BYTES NULL, reason STRING, PRIMARY KEY (tenant_id, setting)).

In this table, a tenant_id of 0 would indicate an override which applies to all tenants, while a ID >0 indicates that it applies to that specific tenant.

The tenant connector RPC then opens two* rangefeeds: one on the host's system.settings, and one on this table.
When a host system.settings row changes, if that row is for a "host visible" setting, its value is emitted over the connector.
When system.tenant_settings_overrides row changes, if its tenant_id was either 0 or the connected tenant's ID, its new value (or deletion thereof) and whether it was global (id=0) or specific (id=tenant) is emitted over the connector.

= could make this three rangefeeds if we wanted to separately listen to the id=0 prefix and id=tenant prefixes of the overrides table to ignore irrelevant updates, as a crude form of predicate pushdown.

If we want all tenants to use some new value for a setting, i.e. make it "tenant read-only" but with same value for all tenants (which differs from host-visible in that that value is not necessarily the host's value, but isn't tenant-specific either), we'd `INSERT INTO system.tenant_settings_overrides (0, some-setting, some-value, 'optionally explain why').
If we then want tenant 42 to actually use its own, specific value for that setting, ditto but 42 instead of 0.

Note above I said that when we emit to the tenant the override value for a specific setting, we were including if that override was specific to them or global. Obviously we could not do that, and just send them the specific if it exists, or the global if not. However by sending them both, if we can say the tenant hangs on to both, when we clear a specific override, we don't have to then dig up the all-tenant value, if there is one, for them to fall back to. Instead, we the tenant settings updater can just be asked to persist both overrides -- tenant-specific and tenant-global -- separately whenever they are updated, and then always use specific, if set, first.

An alternative to system.tenant_setting_overrides would be to extend the schema of system.settings to include an extra column for tenant_id so that in host clusters it also stores guest tenant overrides, e.g. you could make the PK tenant_id, setting but this seems a little messier: say you then store your own settings (i.e. when you run SET CLUSTER SETTING) under your own tenant_id, i.e. as if they were overrides for you. Now this table isn't portable between tenant ids, which is gross for restore. I suppose you could instead say you store your own under 1, since it is reserved, but that's sorta gross too, and you can't use null in a pk. I dunno, I'm sure we could figure it out, but it doesn't seem compelling to me to mix them: system.settings are "my" settings; the ones I set that affect how I run. system.tenant_settings_overrides are values I tell tenants about, but I don't use to control my own behavior. These seem like separate things, so I don't particularly see an advantage to mixing them.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Yes. Every tenant, system included, has a system.settings table that they can read from and write to -- it is theirs.

/ 3) I think are TBD on exactly whether or not those are distinct from each other, or even if they are distinct from the system tenant's system.settings if it learned to store additional copies of settings it wanted to apply to guest tenants. I don't think we need a separate global_settings at least.

My straw man is we make one new system table and it looks something like this:
CREATE TABLE system.tenant_setting_overrides (tenant_id INT NOT NULL, setting STRING NOT NULL, value BYTES NULL, reason STRING, PRIMARY KEY (tenant_id, setting)).

In this table, a tenant_id of 0 would indicate an override which applies to all tenants, while a ID >0 indicates that it applies to that specific tenant.

The tenant connector RPC then opens two* rangefeeds: one on the host's system.settings, and one on this table.
When a host system.settings row changes, if that row is for a "host visible" setting, its value is emitted over the connector.
When system.tenant_settings_overrides row changes, if its tenant_id was either 0 or the connected tenant's ID, its new value (or deletion thereof) and whether it was global (id=0) or specific (id=tenant) is emitted over the connector.

= could make this three rangefeeds if we wanted to separately listen to the id=0 prefix and id=tenant prefixes of the overrides table to ignore irrelevant updates, as a crude form of predicate pushdown.

If we want all tenants to use some new value for a setting, i.e. make it "tenant read-only" but with same value for all tenants (which differs from host-visible in that that value is not necessarily the host's value, but isn't tenant-specific either), we'd `INSERT INTO system.tenant_settings_overrides (0, some-setting, some-value, 'optionally explain why').
If we then want tenant 42 to actually use its own, specific value for that setting, ditto but 42 instead of 0.

Note above I said that when we emit to the tenant the override value for a specific setting, we were including if that override was specific to them or global. Obviously we could not do that, and just send them the specific if it exists, or the global if not. However by sending them both, if we can say the tenant hangs on to both, when we clear a specific override, we don't have to then dig up the all-tenant value, if there is one, for them to fall back to. Instead, we the tenant settings updater can just be asked to persist both overrides -- tenant-specific and tenant-global -- separately whenever they are updated, and then always use specific, if set, first.

An alternative to system.tenant_setting_overrides would be to extend the schema of system.settings to include an extra column for tenant_id so that in host clusters it also stores guest tenant overrides, e.g. you could make the PK tenant_id, setting but this seems a little messier: say you then store your own settings (i.e. when you run SET CLUSTER SETTING) under your own tenant_id, i.e. as if they were overrides for you. Now this table isn't portable between tenant ids, which is gross for restore. I suppose you could instead say you store your own under 1, since it is reserved, but that's sorta gross too, and you can't use null in a pk. I dunno, I'm sure we could figure it out, but it doesn't seem compelling to me to mix them: system.settings are "my" settings; the ones I set that affect how I run. system.tenant_settings_overrides are values I tell tenants about, but I don't use to control my own behavior. These seem like separate things, so I don't particularly see an advantage to mixing them.

I sorta handwaved this above, but the consumer of this connector's stream of overrides would be the settingsUpdater in tenant pods.

My straw man there is we extend settings.Values to have a couple layers of atomic value in each slow, e.g. hostOverride, hostGlobal, local. local is just the current only -- whatever my settingsUpdater read from my system.settings, or the code-default if not (so it always has a value). any read of a setting takes the first one that is set. As an optimization, if the two extra atomic reads are too expensive, we could easily keep a flattened currentValue container too.

As described above, when the updater sees an override come in on the connector, it updates the corresponding container. As mentioned above, we keep separate hostOverride and hostGlobal containers so that if the host sends {TenantSpecific, 'setting-foo', &'value-bar'}, then {AllTenants, 'setting-foo', &'value-baz'} and then finally just sends a {TenantSpecific, 'setting-foo', nil} when our override is deleted, we want to be able to just clear hostOverride and then have hostGlobal containing value-baz take over, instead of demanding the host go look up/re-emit value-baz for us.

RaduBerinde · 2021-12-03T18:39:46Z

docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I sorta handwaved this above, but the consumer of this connector's stream of overrides would be the settingsUpdater in tenant pods.

My straw man there is we extend settings.Values to have a couple layers of atomic value in each slow, e.g. hostOverride, hostGlobal, local. local is just the current only -- whatever my settingsUpdater read from my system.settings, or the code-default if not (so it always has a value). any read of a setting takes the first one that is set. As an optimization, if the two extra atomic reads are too expensive, we could easily keep a flattened currentValue container too.

As described above, when the updater sees an override come in on the connector, it updates the corresponding container. As mentioned above, we keep separate hostOverride and hostGlobal containers so that if the host sends {TenantSpecific, 'setting-foo', &'value-bar'}, then {AllTenants, 'setting-foo', &'value-baz'} and then finally just sends a {TenantSpecific, 'setting-foo', nil} when our override is deleted, we want to be able to just clear hostOverride and then have hostGlobal containing value-baz take over, instead of demanding the host go look up/re-emit value-baz for us.

@irfansharif on 2, tenant pods can't set up rangefeeds outside of their tenant-prefixed KV span. Each KV node would need to set up this range feed and be ready to serve the values via the tenant connector.

RaduBerinde · 2021-12-15T18:52:53Z

I opened #73857 to track the various implementation pieces. Please take a look and let me know if I missed anything.

This commit introduces the three setting classes in the RFC (cockroachdb#73349): `SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly` class replaces the existing `WithSystemOnly()`. In this change we don't yet implement the advertised semantics. We mechanically use `TenantWritable` for all settings except those that were using `WithSystemOnly()` which use `SystemOnly`; this should not change any existing behavior. The classes will be revisited in a separate change, after we implement the semantics. Release note: None

72992: util/log: fix redactability of logging tags r=abarganier a=knz Fixes #72905. Some time in the v21.2 cycle, the log entry preparation logic was refactored and a mistake was introduced: the logging tags were not any more subject to the redaction logic. The result was that redaction markers were missing in log tag values, and if a value had contained unbalanced redaction markers in a value string (say, as part of a SQL table key), it would have caused log file corruption and possibly a confidential data leak. This patch fixes that, by preparing the logging tags in the same way as the main message for each entry. Release note (cli change): A bug affecting the redactability of logging tags in output log entries has been fixed. This bug had been introduced in the v21.2 release. 73937: setting: introduce setting classes r=RaduBerinde a=RaduBerinde This commit introduces the three setting classes in the RFC (#73349): `SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly` class replaces the existing `WithSystemOnly()`. In this change we don't yet implement the advertised semantics. We mechanically use `TenantWritable` for all settings except those that were using `WithSystemOnly()` which use `SystemOnly`; this should not change any existing behavior. The classes will be revisited in a separate change, after we implement the semantics. Release note: None 73978: opt: fix like escape processing for span constraints r=cucaroach a=cucaroach Fixes: #44123 Previously no attempt was made to properly handle escape ('\\') sequence in like patterns being turned into constraints. Refactor code used to process like at runtime to generate a regexp and use that to properly handle index constraint generation. Release note (sql change): Escape character processing was missing from constraint span generation which resulted in incorrect results when doing escaped like lookups. 74102: sql: do not fetch virtual columns during backfill r=mgartner a=mgartner Fixes #73372 Release note (bug fix): A bug has been fixed that caused internal errors when altering the primary key of a table. The bug was only present if the table had a partial index with a predicate that referenced a virtual computed column. This bug was present since virtual computed columns were added in version 21.1.0. 74110: bazel: require setting `cockroach_cross=y` to opt into cross toolchains r=irfansharif a=rickystewart With `--incompatible_enable_cc_toolchain_resolution` set in #73819, now Bazel selects the appropriate toolchain for you. Bazel was selecting the `cross_linux_toolchain` when building for the host platform on Linux, resulting in link errors when trying to compile `stress` under `race`. We update the toolchains to instead require opting into the cross toolchains by defining `cockroach_cross=y`. Closes #73997. Release note: None 74111: bench/rttanalysis: allow roundtrips to be off by 1 r=ajwerner a=ajwerner If we don't have a range, let the currently estimate be wrong by 1. We mostly care about the ballpark and the growth rate. I'm sick of these flakes. Fixes #73884. Release note: None 74152: ci: don't delete `test.json.txt` after processing r=tbg a=rickystewart We've seen some issues where Bazel jobs are failing in `github-post` (#73841, #74013) with the following output: found outstanding output. Considering last test failed: It's hard to say what the problem is because these scripts haven't kept the `test.json.txt` in `artifacts`. Here I remove the logic to clean up the file so we can RC further instances of the problem. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

This change implements the `system-only` semantics in the RFC (cockroachdb#73349). All SystemOnly setting values are now invisible from tenant code. If tenant code attempts to get or set a SystemOnly setting by handle, it results in panics in test builds; in non-test builds, these settings always report the default value. Release note (sql change): System-only cluster settings are no longer visible for non-system tenants.

This commit introduces the three setting classes in the RFC (cockroachdb#73349): `SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly` class replaces the existing `WithSystemOnly()`. In this change we don't yet implement the advertised semantics. We mechanically use `TenantWritable` for all settings except those that were using `WithSystemOnly()` which use `SystemOnly`; this should not change any existing behavior. The classes will be revisited in a separate change, after we implement the semantics. Release note: None

vy-ton

Some questions/comments as I make my way through my RFC review backlog from the PM perspective.

vy-ton · 2022-01-07T21:50:52Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+
+New semantics for existing statements for tenants:
+ - `SHOW [ALL] CLUSTER SETTINGS` shows the `tenant-ro` and `tenant-rw` settings.
+   `Tenant-ro` settings that have an override from the host side are marked as


tenant-rw settings can also have an override and should also be marked differently?

vy-ton · 2022-01-07T21:53:35Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+ - `ALTER TENANT <id> RESET CLUSTER SETTING <setting>`
+   - Resets the tenant setting. For `tenant-ro`, the value reverts to the `ALL`
+     value (if it is set), otherwise to the setting default. For `tenant-rw`,
+     the value reverts to the `ALL` value (if it is set), otherwise to whatever


To check my understanding, we will have 3 possible values for each setting?

default value

ALL value

tenant specific value(s)

vy-ton · 2022-01-07T21:56:28Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+
+### SQL changes
+
+New statements for the system tenant only (which concern only `tenant-ro` and


On a single-tenant clusters, will these statements return an error then?

vy-ton · 2022-01-07T22:00:31Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+     PRIMARY KEY (tenant_id, name)
+   )
+   ```
+   This table is only used on the system tenant. All-non-system-tenant values


All-non-system-tenant values are stored in tenant_id=0.

Why are they not stored under different tenant_id?

vy-ton · 2022-01-07T22:01:38Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+active.
+
+The settings on the system tenant will continue to work. On non-system tenants,
+any locally changed settings that are now `system` or `tenant-rw` will revert to


This should be system or tenant-ro, no? Since those are the ones non-system tenants can no longer modify

vy-ton · 2022-01-07T22:10:34Z

docs/RFCS/20211106_multitenant_cluster_settings.md

+   `tenant-rw` setting.
+
+ - control settings relevant to tenant-specific internal implementation (like
+   tenant throttling) that we want to be able to control per-tenant should be


Will system settings need to be controlled per tenant? I thought they only applied to the host cluster

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20211106_multitenant_cluster_settings.md, line 101 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

On a single-tenant clusters, will these statements return an error then?

The statements that require a valid tenant id would error out since there is no valid id that you can use. ALTER TENANT ALL would work (with no consequence). There is no way to differentiate between a single-tenant cluster and a host cluster that has no tenants created yet.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 117 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

To check my understanding, we will have 3 possible values for each setting?

default value

ALL value

tenant specific value(s)

Conceptually there are four values, as follows:

tenant-specific value set by the host cluster (via ALTER TENANT <id> SET CLUSTER SETTING).
all-tenant value set by the host cluster (via ALTER TENANT ALL SET CLUSTER SETTING).
value set by the tenant (only for tenant-rw!).
default value (hardcoded).

This is the order of precedence - the top-most that is set is the visible value.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 135 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

tenant-rw settings can also have an override and should also be marked differently?

I think it's a typo, it was meant to say tenant-rw. For tenant-ro all settings are effectively overrides (but we can make that clear in the statement output too).

docs/RFCS/20211106_multitenant_cluster_settings.md, line 166 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

All-non-system-tenant values are stored in tenant_id=0.

Why are they not stored under different tenant_id?

This is referring to the ALL value (ALTER TENANT ALL SET CLUSTER SETTING). ALL values are stored in the table under tenant_id=0.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 199 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

This should be system or tenant-ro, no? Since those are the ones non-system tenants can no longer modify

Yes, should be tenant-ro.

docs/RFCS/20211106_multitenant_cluster_settings.md, line 217 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

Will system settings need to be controlled per tenant? I thought they only applied to the host cluster

Right, should have been tenant-ro, it has been fixed since (by #73937).

This change implements the `system-only` semantics in the RFC (cockroachdb#73349). All SystemOnly setting values are now invisible from tenant code. If tenant code attempts to get or set a SystemOnly setting by handle, it results in panics in test builds; in non-test builds, these settings always report the default value. Release note (sql change): System-only cluster settings are no longer visible for non-system tenants.

74253: setting: prevent use of SystemOnly settings from tenant code r=RaduBerinde a=RaduBerinde #### setting: clean up slot indexes The internal setting code passes around slot indexes as bare integers, and there are confusingly two variants: one is 1-indexed, another is 0-indexed. This change makes all slot indexes 0-indexed and adds a `slotIdx` type. Note: the 1-indexed choice was perhaps useful at the time to help find code paths where the slot index is not initialized, but now the slot index is set along with other mandatory fields by `init()`. Release note: None #### setting: prevent use of SystemOnly settings from tenant code This change implements the `system-only` semantics in the RFC (#73349). All SystemOnly setting values are now invisible from tenant code. If tenant code attempts to get or set a SystemOnly setting by handle, it results in panics in test builds; in non-test builds, these settings always report the default value. Release note (sql change): System-only cluster settings are no longer visible for non-system tenants. 74575: build: fix Pebble nightly benchmarks r=nicktrav a=jbowens The Pebble nightly benchmarks stopped running due to a build error stemming from the removal of the generated code from the codebase. Release note: None 74598: scpb and scplan refactor r=postamar a=postamar This change contains two refactoring commits: 1. `scpb: move Node to screl, introduce scpb.TargetState`. This commit is best reviewed by looking at the changes in scpb first, and then in the protos, and then the rest. The most subsequent complexity is around the event logging and in the stage builder, which benefitted from a deeper rewrite. 2. `scplan: make child packages internal, including scgraph and scgraphviz`. This just moves stuff around, although there are a few notable changes in `plan.go`. 74639: authors: add gtr to authors r=gtr a=gtr authors: add gtr to authors Release note: None Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Gerardo Torres <[email protected]>

Updated planProjectionOperators and planSelectionOperators to include case for tree.NotExpr Replaced RunTests with RunTestsWithoutAllNullsInjection Fixes for build & lint issues Removed unused field Addressing PR review comments backupccl: use SpanGroup.Sub instead of CoveringMerge Release note: none. rowenc: reduce reliance on the entire TableDescriptor This change reduces uses of the entire TableDescriptor. Release note: None descpb: add and use NoColumnID constant This change adds a `descpb.NoColumnID` constant so we don't have to use `descpb.ColumnID(encoding.NoColumnID)` everywhere. Release note: None rowenc: remove DatumAlloc.scratch This commit removes the `DatumAlloc.scratch` field which is always nil. Release note: None sql: move `DatumAlloc` to sem/tree This commit moves `DatumAlloc` from `sql/rowenc` to `sql/sem/tree`. This is a fairly low-level construct and is not used exclusively for row encoding/decoding. Release note: None rowenc: move low-level key encoding functions to subpackage The `rowenc` package contains a hefty mix of functions operating at different conceptual levels - some use higher level constructs like table and index descriptors, others are lower level utilities. The naming of these functions doesn't help, for example `EncodeTableKey` sounds like a top-level function but is actually a low level utility that appends a single value to a key This commit moves the lower level utilities for encoding datum values into keys to the `rowenc/keyside` package. Release note: None rowenc: minor cleanup of array decoding code The code around array decoding was confusing; we now have a single `decodeArray` variant which is the inverse of `encodeArray`. Release note: None rowenc: move low-level value encoding functions to subpackage This commit moves the lower level utilities for encoding datum values into values to the `rowenc/valueside` package. Release note: None valueside: introduce ColumnIDDelta type When encoding multiple values, we encode the differences between the ColumnIDs. This difference is passed to `valueside.Encode`, but it is passed as a `descpb.ColumnID`. This commit introduces a new type to make this distinction more evident. Release note: None valueside: add Decoder helper We have a few copies of the same logic of decoding multiple SQL values from a Value, in particular in range feed handling code. This commit adds a helper type that centralizes this logic, reducing duplication. Note that the query executione engines retain their own specialized implementations. Release note: None sql: release BatchFlowCoordinator objects to pool This commit adds `BatchFlowCoordinator` objects to their `vectorizedFlowCreator`'s `releasables` slice, which ensures that their `Release` method is called and that they are properly recycled. Before this change, heap profiles and instrumentation both indicated that these objects were not being recycled as intended. For example, heap profiles looked like: ``` Type: alloc_objects Time: Dec 30, 2021 at 2:10pm (EST) Active filters: focus=Pool\).Get Showing nodes accounting for 92189, 0.57% of 16223048 total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 71505 79.69% | github.com/cockroachdb/cockroach/pkg/sql/colflow.NewBatchFlowCoordinator /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:218 10923 12.17% | github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.newLockTableGuardImpl /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table.go:452 2048 2.28% | github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree.(*Map).maybeInitialize /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree/map.go:112 1170 1.30% | github.com/cockroachdb/cockroach/pkg/sql/colexec.newMaterializerInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:179 1030 1.15% | github.com/cockroachdb/pebble.(*Iterator).Clone /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:1228 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql.MakeDistSQLReceiver /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/colfetcher.NewColBatchScan /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:223 585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/span.MakeBuilder /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/span/span_builder.go:61 585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccGet /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:859 585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2357 128 0.14% | github.com/cockroachdb/pebble.(*DB).newIterInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:765 0 0% 0% 89729 0.55% | sync.(*Pool).Get /Users/nathan/Go/go/src/sync/pool.go:148 ``` Notice that `sync.(*Pool).Get` is responsible for *0.55%* of heap allocations and **79.69%** of these are from `colflow.NewBatchFlowCoordinator`. After this change, that source of allocations goes away and we see the following impact on micro-benchmarks: ``` name old time/op new time/op delta KV/Scan/SQL/rows=1-10 95.1µs ± 7% 95.9µs ± 5% ~ (p=0.579 n=10+10) KV/Scan/SQL/rows=10-10 100µs ± 3% 103µs ±12% ~ (p=0.829 n=8+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=10-10 21.7kB ± 0% 21.5kB ± 0% -0.76% (p=0.000 n=10+10) KV/Scan/SQL/rows=1-10 20.1kB ± 0% 19.9kB ± 0% -0.70% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-10 245 ± 0% 244 ± 0% -0.41% (p=0.000 n=10+10) KV/Scan/SQL/rows=10-10 280 ± 0% 279 ± 0% -0.36% (p=0.001 n=8+9) ``` sql/catalog: restore fast-path in FullIndexColumnIDs This commit restores a [fast-path](https://github.com/cockroachdb/cockroach/commit/c9e116e586f24c5f3a831ac653f14fd03f588b93#diff-19625608f4a6e23e6fe0818f3a621e716615765cb338d18fe34b43f0a535f06dL140) in `FullIndexColumnIDs` which was lost in c9e116e. The fast-path avoided the allocation of a `ColumnID` slice and a `IndexDescriptor_Direction` slice in `FullIndexColumnIDs` when given a unique index. In such cases, these slices are already stored on the `IndexDescriptor`. ``` name old time/op new time/op delta KV/Scan/SQL/rows=1-10 94.9µs ±10% 94.9µs ± 8% ~ (p=0.739 n=10+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=1-10 20.1kB ± 0% 20.1kB ± 1% ~ (p=0.424 n=10+10) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-10 245 ± 0% 241 ± 0% -1.63% (p=0.000 n=10+8) ``` kv: protect Replica's lastToReplica and lastFromReplica fields with raftMu This commit moves the Replica's lastToReplica and lastFromReplica from under the `Replica.mu` mutex to the `Replica.raftMu` mutex. These are strictly Raft-specific pieces of state, so we don't need fine-grained locking around them. As a reward, we don't need to grab the `Replica.mu` exclusively (or at all) when setting the fields in `Store.withReplicaForRequest`. The locking in `setLastReplicaDescriptors` showed up in a mutex profile under a write-heavy workload. It was responsible for **3.44%** of mutex wait time. Grabbing the mutex was probably also slowing down request processing, as the exclusive lock acquisition had to wait for read locks to be dropped. coldata: operate on Nulls value, not reference This commit changes `col.Vec.SetNulls` to accept a `Nulls` struct by value instead of by pointer. This lets us avoid a heap allocation on each call to `Nulls.Or`. Releaes note: None roachtest: fix silly bug in tlp roachtest I thought that we'd get some basic coverage of roachtests in unit tests at low duration or something, but I guess not. Release note: None sql: support readonly default_with_oids var This is just one less error when importing PGDUMP. Release note (sql change): We now support `default_with_oids` which only accepts being false. importccl,jobs: truncate all job system errors In #73303 we truncated the row data in this error message to prevent problems with large row data preventing the job status from being saved. However, truncating the row means that our "experimental_save_rejected" option does not work as expected. This feature previously saved entire rows and now it would have truncated rows in some case. Now, we only truncate the error when producing the error string. Further, we also add truncation to the job system to prevent other parts of the system from saving too much data to the job system. Release note: None importccl: remove unused argument Release note: None sql/logictest: add diff support to expectation output This adds support for generating a unified diff for expectation mismatches. This can make it easier to spot the differences between two results when the result contains many rows. For example: ``` ../../sql/logictest/testdata/logic_test/system_namespace:54: SELECT * FROM system.namespace expected: 0 0 defaultdb 50 0 0 postgres 51 0 0 system 1 0 0 test 52 1 0 public 29 1 29 comments 24 1 29 database_role_settings 44 1 29 descriptor 3 1 29 descriptor_id_seq 7 1 29 eventlog 12 1 29 jobs 15 1 29 join_tokens 41 1 29 lease 11 1 29 locations 21 1 29 migrations 40 1 29 namespace 30 1 29 protected_ts_meta 31 1 29 protected_ts_records 32 1 29 rangelog 13 1 29 replication_constraint_stats 25 1 29 replication_critical_localities 2 1 29 replication_stats 27 1 29 reports_meta 28 1 29 role_members 23 1 29 role_options 33 1 29 scheduled_jobs 37 1 29 settings 6 1 29 sql_instances 46 1 29 sqlliveness 39 1 29 statement_bundle_chunks 34 1 29 statement_diagnostics 36 1 29 statement_diagnostics_requests 35 1 29 statement_statistics 42 1 29 table_statistics 20 1 29 transaction_statistics 43 1 29 ui 14 1 29 users 4 1 29 web_sessions 19 1 29 zones 5 50 0 public 29 51 0 public 29 52 0 public 29 but found (query options: "rowsort" -> ignore the following ordering of rows) : 0 0 defaultdb 50 0 0 postgres 51 0 0 system 1 0 0 test 52 1 0 public 29 1 29 comments 24 1 29 database_role_settings 44 1 29 descriptor 3 1 29 descriptor_id_seq 7 1 29 eventlog 12 1 29 jobs 15 1 29 join_tokens 41 1 29 lease 11 1 29 locations 21 1 29 migrations 40 1 29 namespace 30 1 29 protected_ts_meta 31 1 29 protected_ts_records 32 1 29 rangelog 13 1 29 replication_constraint_stats 25 1 29 replication_critical_localities 26 1 29 replication_stats 27 1 29 reports_meta 28 1 29 role_members 23 1 29 role_options 33 1 29 scheduled_jobs 37 1 29 settings 6 1 29 sql_instances 46 1 29 sqlliveness 39 1 29 statement_bundle_chunks 34 1 29 statement_diagnostics 36 1 29 statement_diagnostics_requests 35 1 29 statement_statistics 42 1 29 table_statistics 20 1 29 transaction_statistics 43 1 29 ui 14 1 29 users 4 1 29 web_sessions 19 1 29 zones 5 50 0 public 29 51 0 public 29 52 0 public 29 Diff: --- Expected +++ Actual @@ -20,3 +20,3 @@ 1 29 replication_constraint_stats 25 - 1 29 replication_critical_localities 2 + 1 29 replication_critical_localities 26 1 29 replication_stats 27 logic.go:3340: ../../sql/logictest/testdata/logic_test/system_namespace:100: error while processing ``` The diff output is only added when the `-show-diff` flag is provided because in some cases it can produce more noise than it is worth. The diff library used here is the same one already used by the testify libraries we depend on. Release note: None teamcity-trigger: give `kvserver` package higher stress timeout Closes https://github.com/cockroachdb/cockroach/issues/69519. Release note: None backupccl: allow columns of type array in EXPORT PARQUET Previously, EXPORT PARQUET only supported CRDB relations whose columns were scalars of type int, string, float, or boolean. This change allows columns of type array whose values can be int, string, float, boolean. Following the CRDB ARRAY documentation, a value in an exported array can be NULL as well. Informs: #67710 Release note (sql change): EXPORT PARQUET can export columns of type array backupccl: remove old makeImportSpans Release note: none. backupccl: expand some comments on covering code Release note: none. security: make the bcrypt cost configurable Release note (security update): For context, when configuring passwords for SQL users, if the client presents the password in cleartext via ALTER/CREATE USER/ROLE WITH PASSWORD, CockroachDB is responsible for hashing this password before storing it. By default, this hashing uses CockroachDB's bespoke `crdb-bcrypt` algorithm, itself based off the standard Bcrypt algorithm. The cost of this hashing function is now configurable via the new cluster setting `server.user_login.password_hashes.default_cost.crdb_bcrypt`. Its default value is 10, which corresponds to an approximate password check latency of 50-100ms on modern hardware. This value should be increased over time to reflect improvements to CPU performance: the latency should not become so small that it becomes feasible to bruteforce passwords via repeated login attempts. Future versions of CockroachDB will likely update the default accordingly. security: make `server.user_login.min_password_length` visible in doc gen This cluster setting was meant to be exported for visibility in auto-generated docs (we've documented it before). This was an oversight. Release note: None authors: add Fenil Patel to authors Release note: None bench: force the index join in BenchmarkIndexJoin Without using the index hint, we now choose to perform the full scan over the primary index and put the filter on top of that rather than performing a limited scan of the secondary followed by an index join. I confirmed that this is the case on 21.1 and 21.2 binaries, so I'll backport to the latter (in order to make the comparison between releases sane). Release note: None dev: a few improvements to the script * Add `set -e` so if the `dev` build fails, it doesn't try to run the binary anyway. * Add a way to force recompilation of `dev` (useful for developers). Release note: None changefeedccl: Shutdown tenant before main server. Shutdown tenant server before stopping main test server. This elliminates some of the error messages we see in the test output when tenant attempts to connect to some ranges which are no longer accessible. Release Notes: None importccl: fix import pgdump target column bug Previously, if a COPY FROM statement had less columns than the CREATE TABLE schema defined in the dump file, we would get a nil pointer exception. This is because we were not filling the non-targeted columns with a NULL datum. This change fixes that and aligns behvaiour with how INSERT handles non-targeted columns. Release note (bug fix): IMPORT TABLE ... PGDUMP with a COPY FROM statement in the dump file that has less target columns than the CREATE TABLE schema definition would result in a nil pointer exception. roachtest: add new passing tests for pgjdbc nightly Release note: None go.mod: fix ordering The first `require` is for direct dependencies. The second one for indirect dependencies. There's no need for a third one. Release note: None Makefile: specify that code needs to be generated for `roach{test,prod}` Release note: None setting: clean up slot indexes The internal setting code passes around slot indexes as bare integers, and there are confusingly two variants: one is 1-indexed, another is 0-indexed. This change makes all slot indexes 0-indexed and adds a `slotIdx` type. Note: the 1-indexed choice was perhaps useful at the time to help find code paths where the slot index is not initialized, but now the slot index is set along with other mandatory fields by `init()`. Release note: None setting: prevent use of SystemOnly settings from tenant code This change implements the `system-only` semantics in the RFC (#73349). All SystemOnly setting values are now invisible from tenant code. If tenant code attempts to get or set a SystemOnly setting by handle, it results in panics in test builds; in non-test builds, these settings always report the default value. Release note (sql change): System-only cluster settings are no longer visible for non-system tenants. build: fix Pebble nightly benchmarks The Pebble nightly benchmarks stopped running due to a build error stemming from the removal of the generated code from the codebase. Release note: None scpb: move Node to screl, introduce scpb.TargetState This refactoring commit moves scpb.Node to screl and groups the targets into a new scpb.TargetState. Separating the quasi-constant target elements and statuses from the elements' intermediate statuses ultimately makes for cleaner code in the declarative schema changer. Release note: None scplan: make child packages internal, including scgraph and scgraphviz This commit moves scgraph and scgraphviz under scplan, and makes all its child packages internal. Release note: None <pkg>: <short description - lowercase, no final period> authors: add gtr to authors Release note: None sql: format statements for the UI Previously, statements sent to the UI were not formatted, making them difficult to read. With this change, statements are now formatted using a builtin function that prettifies statements (using existing pretty-printing logic). Statements are formatted using pre-determined pretty-printing configuration options mentioned in this issue: #71544. Resolves: #71544 Release note (sql change): statements are now formatted prior to being send to the UI, this is done using a new builtin function that formats statements. ui: added formatting to statements on the details pages Previously, statements displayed on the statement/transaction/index details pages were not formatted. Formatting was added to allow for better readability of statements on these detail pages. Requested statements are now formatted on the server using the existing pretty-printing logic. Statements returned from the statement handler are now formatted. Formatting is done via a new builtin function 'prettify_statement', a wrapper function over the existing pretty-printing logic. Resolves: #71544 Release note (ui change): added formatting to statements on the statement, transaction and index details pages. roachpb: InternalServer API for tenant settings This commit introduces the API that the tenants will use to obtain and list for updates to tenant setting overrides. The API was designed to allow for maximum flexibility on the server side so that it can be implemented as a range feed without any extra state (if necessary). Release note: None schemachanger: fully qualify object names inside event log entries Previously, the original SQL was displayed inside the declarative schema changers event logs, which was both missing redactions and full name resolutions. This was inadequate because the existing schema changer always fully resolved missing names within its entries. To address this, this patch adds new metadata that contains the fully resolved and redacted text. Release note: None schemachanger: full resolve names inside the AST for event logs and errors Previously, the declarative schema changer left the AST untouched when resolving names. This was inadequate because both event log entry generation and some error messages generated expect the fully resolved names inside the AST. To address this, this patch adds support for copying and altering the AST including support for adding annotations. Additionally, all resolved names are now propagated back into the AST and extra validation is introduced to make sure that no unresolved names are left. Release note: None schemachanger: add statement tag to event log entries Previously, the tag field inside the event log entries generated by the declarative schema changer were empty. This was inadequate because the existing schema changer always populated these fields. To address this, this patch will now populate the statement tag field inside the element metadata and into event log entries. Release note: None rfc: token-based authentication for SQL session revival Release note: None vendor: upgrade cockroachdb/apd to v3 This commit picks up the following changes to `cockroachdb/apd`: - https://github.com/cockroachdb/apd/pull/103 - https://github.com/cockroachdb/apd/pull/104 - https://github.com/cockroachdb/apd/pull/107 - https://github.com/cockroachdb/apd/pull/108 - https://github.com/cockroachdb/apd/pull/109 - https://github.com/cockroachdb/apd/pull/110 - https://github.com/cockroachdb/apd/pull/111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. sql/sem: remove EvalContext.getTmpDec `apd.Decimal` can now be entirely stack allocated during arithmetic, so there's no longer any need for this. With https://github.com/cockroachdb/apd/pull/104, this does not introduce any new heap allocations: ``` ➜ (cd pkg/sql/sem/tree && goescape . | grep moved | wc -l) 328 ``` scripts: add sgrep @aliher1911 showed me this handy awk trick to filter goroutines [here]. I find myself reaching for this frequently, but it always takes me a minute or two to get it right. This helper will make it a lot easier and will perhaps enable many others to reap the benefits as well. [here]: https://github.com/cockroachdb/cockroach/pull/66761#pullrequestreview-691529838 Release note: None sqlproxyccl: better test output This moves the logging output to files and enables inspection of the authentication results in logs crdb-side. Release note: None storage/metamorphic: update clearRangeOp to not conflict with in-flight writes This change updates clearRangeOp to only run on key spans that do not have any transactional writes in-progress. This more closely matches behaviour in KV and avoids cases where we'd trample on intents and other in-flight keys. Also makes a similar adjustment to mvccClearTimeRangeOp. Fixes #72476. Speculatively. Reproduction was nearly impossible. Release note: None. Revert "roachpb: change `Lease.String()`/`SafeFormat()` to support refs." This reverts commit 33fe9d15b2660cc40d255a3d448d88bb35c2c39d. storage,roachpb: Add Key field to WriteTooOldError Currently, WriteTooOldErrors that happen as part of a ranged put operation (eg. MVCCClearTimeRange) can be very opaque, as we don't know what key we tried to "write too old" on. This change addresses that by adding a Key field to WriteTooOldError to store one of the keys the error pertains to. Informs #72476. Release note: None. dev: enumerate explicit list of all generated `.go` files when hoisting We should be able to `find _bazel/bin/go_path -type f -name '*.go'` and list the generated files that way, but bazelbuild/rules_go#3041 is in the way and I can't figure out how to resolve it. Instead we have to do things the hard way: run `bazel aquery` and parse out the paths to all the generated files. In this way we avoid accidentally hoisting out stale symlinks from a previous build. If bazelbuild/rules_go#3041 is resolved upstream then this change can be reverted. Release note: None sql: public schema long running migration Release note: None sql: insert missing public schema namespace entry When restoring a database, a namespace entry for the public schema was not created. Release note: None authors: add msirek to authors Release note: None sql: Add hints to CREATE/ALTER table errors for MR Add some hints when trying to create (or alter to) a multi-region table which indicate that if the database is not multi-region enabled, it should be converted to a multi-region enabled database via a "ALTER DATABASE ... SET PRIMARY REGION <region>" statement. Release note: None catalog: add Index*Columns methods to table descriptor These methods return the columns referenced in indexes as []catalog.Column slices. Release note: None catalog: add Index*ColumnDirections methods to table descriptor This complements the previous commit which added Index*Column methods to catalog.TableDescriptor. Release note: None catalog: remove FullIndexColumnIDs Calls to this function can now be replaced with recently-added IndexFullColumns and IndexFullColumnDirections method calls on the table descriptor. Release note: None tabledesc: improve memory-efficiency of Index*Column* methods This commit removes IndexCompositeColumns (which wasn't actually used and is unlikely to ever be) and generates the indexColumnCache with less memory allocations. The current scheme is causing a statistically-significant performance regression. Release note: None sql: allow the 1PC optimization in some cases in the extended protocol A previous commit (7e2cbf51869fc326974a5665db80da8b29422631) fixed our pgwire implementation so that it does not auto-commit a statement executed in the extended protocol until a Sync message is received. That change also had the undesired effect of disabling the 1PC ("insert fast path") optimization that is critical for write-heavy workloads. With this current commit, the 1PC optimization is allowed again, as long as the statement execution is immediately followed by a Sync message. This still has the correct bugfix semantics, but allows the optimization for the common case of how the extended protocol is used. No release note since this only affects unreleased versions. Release note: None execgen: remove temporary decimals from the helper This commit removes two temporary decimals from `execgen.OverloadHelper` since after the upgrade of the decimal library it can use stack-allocated temporary objects without them escaping to the heap. Release note: None colexec: batch allocate datums for projectTupleOp This has a profound impact on the amount of garbage generated by the delivery query in TPC-C. ``` name old time/op new time/op delta TPCC/mix=delivery=1-16 38.0ms ± 2% 35.8ms ± 1% -5.76% (p=0.000 n=9+8) name old alloc/op new alloc/op delta TPCC/mix=delivery=1-16 8.17MB ± 1% 7.97MB ± 1% -2.36% (p=0.000 n=9+10) name old allocs/op new allocs/op delta TPCC/mix=delivery=1-16 80.2k ± 0% 20.3k ± 1% -74.65% (p=0.000 n=10+9) ``` Leveraged https://github.com/cockroachdb/cockroach/pull/74443 to find this. Release note: None kvserver/loqrecovery: check full key coverage in quorum recovery Previously when doing unsafe replica recovery, if some ranges are missing or represented by stale replicas that were split or merged, recovery will change cluster to inconsistent state with gaps or overlaps in keyspace. This change adds checks for range completeness as well as adds a preference for replicas with higher range applied index. Release note: None sql: fix enum hydration in distsql expression evaluation Fixes: #74442 Previously in some circumstances we could fail to hydrate enum types used in join predicate expressions and possibly other situations. Now types used in ExprHelper are always hydrated during Init phase when a distsql type resolver is being used. Also add a test case for the lookup semi join repro case. Release note (bug fix): Fix panic's possible in some distributed queries using enum's in join predicates. backupccl: add libgeos to BUILD.bazel This is required when using randgen so that libgeos can be initialized when a geometric type is generated. Fixes #73895 Release note: None sql: refactor pg_builtin to use actual grant options Refactor builtins.priv -> privilege.Privilege. Replace privilege.Kind with privilege.Privilege in functions that need access to privilege.Privilege.GrantOption. Release note: None sql: refactor pg_builtin to use actual grant options The builtins has_table_privilege, has_column_privilege, has_any_column_privilege now use privileges.Priv.GrantOption instead of privileges.Kind.GRANT. Release note: None dev: introduce common benchmarking flags Specifically: --bench-mem (to print out memory allocations); --bench-time (to control how long each benchmark is run for); --count (how many times to run it, also added to `dev test`); -v and --show-logs (similar to `dev test`) We also add supports for args-after-double-dash-are-for-bazel within `dev bench`. This commit is light on testing (read: there isn't any), so doesn't bump DEV_VERSION to roll it out to everyone just yet. Release note: None execgen: skip encoding/decoding JSON when hashing it Previously, in order to hash a JSON object we would encode and decode it (due to the behavior of `coldata.JSONs.Get` and the setup of the execgen). This commit refactors how the hash functions are handled in the execgen which allows us to peek inside of `coldata.Bytes` (that is under the `coldata.JSONs`) in order to get direct access to the underlying `[]byte`. This should be a minor performance improvement but also allows us to remove the need for the overload helper in the hashing context. It is possible (although I'm not certain) that the hash computation is now different, so this commit bumps the DistSQL versions to be safe. Release note: None colexechash: fix BCEs in some cases Previously, we were not getting (and not asserting) some of the bounds check eliminations in `rehash` function because we were accessing `.Sliceable` property in the wrong context. This is now fixed (by accessing it via `.Global`) as well as assertions are added correctly. Additionally, this commit pulls out the call to the cancel checker out of the templated block to reduce the amount of code duplication. Release note: None execinfra: prefer leased table descriptors This commit makes better use of the table descriptor protos in the DistSQL specs by first checking if they've already been leased. If so, then we use those instead of re-generating catalog.TableDescriptor instances. This has a statistically-significant impact on memory allocations, as illustrated with this microbenchmark which I ran on my development machine: name old time/op new time/op delta KV/Scan/SQL/rows=1-16 190µs ± 7% 184µs ± 4% -3.60% (p=0.016 n=10+8) KV/Scan/SQL/rows=10-16 196µs ± 4% 198µs ±12% ~ (p=0.762 n=8+10) name old alloc/op new alloc/op delta KV/Scan/SQL/rows=1-16 19.5kB ± 1% 17.4kB ± 1% -11.12% (p=0.000 n=9+10) KV/Scan/SQL/rows=10-16 21.0kB ± 1% 18.9kB ± 1% -10.20% (p=0.000 n=10+10) name old allocs/op new allocs/op delta KV/Scan/SQL/rows=1-16 222 ± 0% 210 ± 1% -5.59% (p=0.000 n=7+10) KV/Scan/SQL/rows=10-16 256 ± 0% 244 ± 0% -4.84% (p=0.000 n=10+7) This change opens us to the possibility of no longer shipping the whole table descriptor proto in the DistSQL spec. Release note: None sql: add missing error check in GetHydratedZoneConfigForNamedZone Closes #74606 Release note: None sql: rename pg_cast_provolatile_dump.csv to pg_cast_dump.csv Release note: None sql: remove castInfo `castMap` now contains all volatility information and `castInfo` is no longer needed. `castInfo` has been removed. For backwards compatibility, some invalid casts between VARBIT and integer types have been added to `castMap`. These casts have unexpected behavior before and after this commit. They are not supported by Postgres. They should be disallowed in the future. See #74577 and 74580 Casts between tuple and string types are now dynamically handled in `lookupCast` support casts with named record types that have different OIDs than `oid.T_record`. Casts from OID and REG* types to INT2 are now allowed to maintain backward compatibility. Release note: None sql: update pg_cast_dump.csv This commit updates the `pg_cast_dump.csv` file with new rows for casts from the JSONB type (OID 3802). It also adds the Postgres version as a column to annotate the version of Postgres that the CSV was generated from. This is important because the `pg_cast` table can change from version to version. This CSV was generated from Postgres version 13.5. Release note: None sql: consistently order pg_cast_dump.csv Release note: None sql: check cast contexts against Postgres's pg_cast table This commit includes the `pg_cast.castcontext` column in `pg_cast_dump.csv` and uses it to validate the `maxContext` field in `castMap`. Note that Postgres's `pg_cast` table does not include all possible casts, such as automatic I/O conversions, so this test is not comprehensive. Release note: None ci: bazelize nightly pebble ycsb, write-throughput benchmarks We still have to take care of the metamorphic nightly. Part of #67335. Release note: None authors: add bardin to authors Release note: None opt: fix corner case with lookup joins and provided orderings This commit fixes a case where the lookup join was passing through a provided ordering with unnecessary columns. This was caused by imperfect FDs at the join level such that the ordering cannot be simplified at the join level but it can be simplified at the level of its input. Note that the case causes an internal error in test builds but there are no known examples of user-visible problems in non-test builds (hence no release note). Fixes #73968. Release note: None rpc: fix span use-after-Finish in internalClientAdapter The internalClientAdapter performs some local "RPCs" by directly calling the server method, without going through gRPC. The streaming RPC calls are made on a different goroutine, and these goroutines were using the callers tracing span. These goroutines could outlive the caller's span, resulting in a use-after-Finish crash. This patch fixes them by creating dedicated RPC spans, mimicking what our gRPC interceptors do. Fixes #74326 Release note: None vendor: Bump pebble to 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 Changes pulled in: ``` 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 *: Add trySeekUsingNext optimization to SeekGE 0c503048eb0365981929177c30178add8a56ae3e sstable: add (*sstable.Writer).RangeKey{Set,Unset,Delete} methods fe52b49cc28df62dce9b00c382a5ce217936be56 tool/logs: aggregate compaction logs by node and store ID 8ab4358bc59dfa62e5e34e4b0e5ce81a68f5fe91 sstable: return err from `(BlockPropertyCollector).FinishTable` 91c18ef0ee999980c2869d11e5ce468410acbe8d internal/keyspan: add FragmentIterator interface 953fdb078ff0585489206ae96e1d80ca9f6f90c7 internal/keyspan: implement SetBounds on Iter aa376a819bf67cd6766ee827feed4bf0bd508f1f tool: add compaction log event aggregation tool ``` Release note: None. storage: Use optimized SeekGE in CheckSSTConflicts This nearly reverts #73514 by moving back to calling SeekGE on the engine to skip past any empty spans on either the engine or the SSTable. This is the more optimal approach on average, and given optimizations in cockroachdb/pebble#1412 which this change depends on, it also ends up performing better than a SeekPrefixGE-driven appraoch and the pre-#73514 approach. Improvement when running BenchmarkCheckSSTConflicts against the pre-#73514 revision (vs. this one): ``` name old time/op new time/op delta CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=false-24 72.6µs ±11% 66.3µs ± 1% -8.67% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=true-24 12.2ms ± 1% 1.7ms ± 1% -86.41% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=false-24 69.8µs ± 2% 67.4µs ± 1% -3.48% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=true-24 13.3ms ± 3% 2.8ms ± 1% -78.97% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=false-24 75.8µs ± 3% 63.8µs ± 1% -15.86% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=true-24 13.0ms ± 1% 1.9ms ± 1% -85.11% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=false-24 69.8µs ±11% 64.6µs ± 1% -7.45% (p=0.008 n=5+5) CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=true-24 14.8ms ± 9% 3.1ms ± 2% -79.05% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=false-24 66.1µs ± 2% 63.7µs ± 1% -3.65% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=true-24 14.2ms ± 9% 1.9ms ± 1% -86.55% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=false-24 72.3µs ±10% 64.5µs ± 0% -10.77% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=true-24 122ms ± 2% 17ms ± 1% -86.03% (p=0.008 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=false-24 69.0µs ± 9% 62.4µs ± 1% -9.57% (p=0.032 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=true-24 14.0ms ± 1% 2.3ms ± 2% -83.46% (p=0.016 n=4+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=false-24 69.4µs ± 9% 62.7µs ± 1% -9.63% (p=0.016 n=5+5) CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=true-24 140ms ± 5% 26ms ± 1% -81.70% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=false-24 69.2µs ±10% 62.5µs ± 1% -9.66% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=true-24 15.3ms ±11% 2.3ms ± 1% -85.21% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=false-24 69.7µs ±12% 63.6µs ± 1% ~ (p=0.095 n=5+5) CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=true-24 148ms ± 6% 28ms ± 2% -80.90% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=false-24 67.1µs ±10% 61.1µs ± 2% -8.93% (p=0.016 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=true-24 14.4ms ± 2% 2.5ms ± 5% -82.45% (p=0.016 n=4+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=false-24 68.9µs ±21% 62.2µs ± 1% -9.76% (p=0.008 n=5+5) CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=true-24 204ms ±14% 42ms ± 5% -79.44% (p=0.008 n=5+5) ``` Fixes #66410. Release note: None. execinfrapb: add a helper for index joins based on the JoinReaderSpec Release note: None rowexec: refactor the joinReader to not exceed the batch size The joinReader operates by buffering the input rows until a certain size limit (which is dependent on the strategy). Previously, the buffering would stop right after the size limit is reached or exceeded, and this commit refactors the code to not exceed the limit except in a case of a single large row. This is what we already do for vectorized index joins. Release note: None sql,kv: introduce Streamer API and use it for index joins in some cases This commit introduces the Streamer API (see https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210617_index_lookups_memory_limits.md) as well as its implementation for the simplest case - when requests are unique and can be served in any order. It additionally hooks up the implementation to be used by the index joins in both execution engines. There are three main pieces that this commit adds: 1. the `Streamer` struct itself. It is largely the same as described in the RFC. Some notable changes are: - `Cancel` is renamed to `Close` and is made blocking to ensure that all goroutines kicked off by the `Streamer` exit before `Close` returns. - `Shrink` has been removed from the `budget` struct (see below for more details). - furthermore, the `budget` interface has been unexported and the `streamer` has been tightly coupled with the `budget`'s implementation. - the TODO about collecting DistSQL metadata is removed because we are collecting the LeafTxnFinalState already when using the LeafTxn. 2. the limited `Streamer` implementation - only `OutOfOrder` mode is supported when the requests are unique. Notably, buffering and caching of the results is not implemented yet. 3. `TxnKVStreamer` component that sits below the SQL fetchers, uses the `Streamer`, and is an adapter from `BatchResponse`s to key/value pairs that fetchers understand. Although not used at the moment, `TxnKVStreamer` is written under the assumption that a single result can satisfy multiple requests. The memory budget of the `Streamer` is utilized lazily. The RFC was calling for making a large reservation upfront and then shrinking the budget if we see that we don't need that large reservation; however, while working on this I realized that lazy reservations are a better fit for this. The `Streamer` can reserve up to the specified limit (determined by `distsql_workmem` variable) against the root monitor (in the degenerate case of a single large row more memory will be reserved). The reservation then never shrinks under the assumption that if the reservation has gotten large, it means it was needed for higher concurrency (or large responses), and it is likely to be needed for the same reasons in the future. The layout of the main components of the `Streamer` implementation: - in `Enqueue` we have a logic similar to what DistSender does in order to split each request (that might span multiple ranges) into single-range requests. Those sub-requests are batched together to be evaluated by a single `BatchRequest`. - `workerCoordinator.mainLoop` is responsible for taking single-range batches, estimating the corresponding response size, and issuing requests to be evaluated in parallel while adhering to the provided memory budget. - `workerCoordinator.performRequestAsync` is responsible for populating the `BatchRequest` and then processing the results while updating the memory budget. Current known limitations that will be addressed in the follow-up work: - at the moment a single row can be split across multiple BatchResponses when TargetBytes limit is reached when the table has multiple column families; therefore, we use the streamer only for single column family cases. We will expand the KV API shortly to not split the rows in multiple column family cases. - manual refresh of spans when `ReadWithinUncertaintyIntervalError` is encountered by a single streamer in a single flow is not implemented. It is an optimization that is considered a must for the final implementation in order to not regress on simple cases in terms of retriable errors. This will be implemented shortly as a follow-up. - I'm thinking that eventually we probably want to disable the batch splitting done by the DistSender to eliminate unnecessary blocking when the streamer's splitting was incorrect. This would give us some performance improvements in face of range boundary changes, but it doesn't seem important enough for the initial implementation. Release note: None sql: check equivalent constraint when creating hash index Fixes #68031 Previously we only try to create constraint for shard column if it's newly created. We check duplicate constraint for shard column when `Alter Primary Key` and `Create Index`, however the check is simply a name check. This pr adds logic to check equivalent constraint by checking if the formatted expression string is the same. With this logic we can try to create the constraint no matter if a shard column is newly created or not. With this fix, we also don't need to expose the constraint through `SHOW CREATE TABLE` result since we make sure the constraint is created or skipped if one already exists. Release note (sql change): Before this change, the check constraint on shard column used by hash sharded index was printed in the corresponding `SHOW CREATE TABLE`. The constraint had been shown because cockroach lacked logic to ensure that shard columns which are part of hash sharded indexs always have the check constraint which the optimizer relies on to achieve properly optimized plans on hash sharded indexes. We no longer display this constraint in `SHOW CREATE TABLE` as it is now implied by the `USING HASH` clause on the relevant index. colexec: clean up the usage of the binary overload helper Previously, for projection and selection operators we would always pass in `execgen.OverloadHelper`. Up until recently it served several purposes, but now it is only used in order to pass down the binary function (as well as the eval context) for the cases when we fallback to the row-by-row computation. This commit takes advantage of this observation and cleans up the situation: now the helper is only passed when it is needed which allows us to remove a lot of redundant code. Additionally, the helper itself has been renamed from `OverloadHelper` to `BinaryOverloadHelper`. Release note: None build: Add PATH to .bazelrc for dev builds. Release note: none opt: intern provided physical properties This commit changes the provided physical properties to be referenced with a pointer inside bestProps, in preparation for adding more fields to physical.Provided. Release note: None opt: add Distribution physical property and Distribute enforcer This commit adds a new physical property called "Distribution" to expressions in the optimizer. Currently, Distribution is just the list of regions that are touched by the expression. This commit also adds a new enforcer called "Distribute", which enforces a particular Distribution on its input. This is similar to the way Sort enforces a particular Ordering. Currently, Distribute is only used to enforce that the results of a query end up in the same region as the gateway node. The Distribution property and Distribute enforcer will enable us to perform better costing of query plans in future commits. This commit adds a tiny cost for the Distribute enforcer, but otherwise does not yet take advantage of distribution for costing. In the future, we can use Distribution to better cost distributed joins and aggregations, and accurately cost the Distribute enforcer so as to avoid scanning data in remote regions if possible. This commit represents a departure from the approach taken in the earlier prototype in #43831. Instead of using a "neighborhood" abstraction to represent arbitrary collections of nodes, this commit sticks to the existing abstractions of region and locality. If we need a new abstration in the future, we should be able to simply modify the representation of Distribution. Additionally, this commit does not make an attempt to represent exactly how data is partitioned across regions (e.g., which column is the partitioning key, whether the data is hash or range partitioned, etc.); it only tracks *which* regions contain data. This greatly simplifies the implementation, and I don't think it will significantly reduce the accuracy of costing. Informs #47226 Release note: None cli: add mt cert create-tenant-signing command This key/cert will be used to generate session revival tokens. No release note since this is only intended to be used internally. Release note: None sql/sem/catid: make a low-level package which defines ID types I'm not sure where exactly in the tree this belongs. Release note: None sql/schemachanger/scpb: adopt catid Release note: None sql/catalog/catpb: add package below scpb and descpb This doesn't go all the way to being principled about what is in catpb and what is in descpb, but it pulls the relevant symbols referenced in scpb down into the new package. Release note: None kvserver: fix bug causing spurious rebalance attempts by the store rebalancer When rebalancing a range, the store rebalancer also tries to move the lease to the voter with the lowest QPS. However, previously, the store rebalancer was doing this even when it found no better replacement targets for the existing stores. This meant that we were redundantly transferring leases over to the coldest voter, even when we weren't rebalancing the range. This was likely contributing to the thrashing observed in https://github.com/cockroachdb/cockroach/issues/69817. Release note (bug fix): A bug that could previously cause redundant lease transfers has now been fixed. schemachanger: delete comments when dropping schemas Previously, the declarative schema changer did not clean comments for schema objects when dropping them. This was inadequate because the system.comment table would have rows for dropped schema objects left over when using the declarative schema changer versus the legacy schema changer. To address this, this patch will remove rows from the system.comment table for schema objects that are dropped. Release note: None sql: adopt CommentUpdater from declarative schema changer. Previously, there was duplicate code for adding/deleting comments on schema objects inside the legacy schema changer. So, each comment type would have similar code for setting up an upsert/delete statements. This patch adopts the CommentUpdater interface in the legacy schema changer, so that logic to update/delete comments can be shared with the declarative schema changer, and the simplify code in the legacy schema changer. Release note: None kvserver: don't use `ClearRange` point deletes with estimated MVCC stats `ClearRange` avoids dropping a Pebble range tombstone if the amount of data that's deleted is small (<=512 KB), instead dropping point deletions. It uses MVCC statistics to determine this. However, when clearing an entire range, it will rely on the existing range MVCC stats rather than computing them. These range statistics can be highly inaccurate -- in some cases so inaccurate that they even become negative. This in turn can cause `ClearRange` to submit a huge write batch, which gets rejected by Raft with `command too large`. This patch avoids dropping point deletes if the statistics are estimated (which is only the case when clearing an entire range). Alternatively, it could do a full stats recomputation in this case, but entire range deletions seem likely to be large and/or rare enough that dropping a range tombstone is fine. Release note (bug fix): Fixed a bug where deleting data via schema changes (e.g. when dropping an index or table) could fail with a "command too large" error. backup: add memory monitor to manifest loading Release note: none. sql: add assignment casts for UPSERTs Assignment casts are now added to query plans for upserts, including `UPSERT`, `INSERT .. ON CONFLICT DO NOTHING`, and `INSERT .. ON CONFLICT DO UPDATE ..` statements. Assignment casts are a more general form of the logic for rounding decimal values, so the use of `round_decimal_values` in mutations is no longer needed. This logic has been removed. Fixes #67083 There is no release note because the behavior of upserts should not change with this commit. Release note: None sql: add logic tests for assignment casts of ON UPDATE expressions Release note: None sql: schema changer not to validate shard column constraint Fixes #67613 The shard column constraint is created internally and should be automatically upheld. So not need to verify it when backfilling hash sharded indexes. Release not: None schemachanger: fix error messages for drop dependencies. 1) When a view depends on another view the message text generated was incorrect. 2) When a sequence is OWNED BY a table, it could be dropped even if there were other dependencies. 3) When RESTRICT was not specified DROP DATABASE would generate the wrong error. 4) If the database name is empty we would generate the wrong error. Release note: None sql: modify message for drop schema privilege errors Previously, the legacy schema changer used a less clear message for when a DROP SCHEMA failed due to a privilege error. This patch will improve the message generated to be more clear by using message from the declarative schema changer. The new message will have the form "must be owner of schema <schema name>". Release note: None schemachanger: avoid leased descriptors for event logging Previously, when fully resolving descriptor names, we would fetch leased descriptors which could cause the schema change transaction to get pushed out. This was inadequate because in certain scenarios we would perpetually retry transactions, since attempts to generate event log entries acquire leases pushing out the schema change transaction. To address this, this patch intentionally avoids leasing descriptors for event logging. Release note: None build: put regular file at bazel-out by default Release note: none. bulk,backup,import: add Write-at-now settings for RESTORE and IMPORT This adds a new arguemnt to BulkAdderOptions and MakeSSTBatcher control the WriteAtBatchTS field in the AddSSTableRequests it sends. This flag is then optionally set in RESTORE and IMPORT based on the the new hidden cluster settigs 'bulkio.restore_at_current_time.enabled' and 'bulkio.import_at_current_time.enabled' respectively. These settings default to off for now, for testing usage only as setting this flag is known to _significantly_ reduce the performance of those jobs until further work is done. Release note: none. sql/row: remove an allocation for Get responses ``` name old time/op new time/op delta IndexJoin/Cockroach-24 8.02ms ± 9% 7.92ms ± 2% ~ (p=1.000 n=9+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.68MB ± 1% 1.62MB ± 1% -3.83% (p=0.000 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 11.1k ± 1% 10.1k ± 0% -8.85% (p=0.000 n=9+9) ``` Release note: None kvstreamer: refactor memory tokens to reduce allocations ``` name old time/op new time/op delta IndexJoin/Cockroach-24 7.72ms ± 7% 7.51ms ± 3% -2.75% (p=0.001 n=10+9) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.61MB ± 1% 1.60MB ± 1% -0.91% (p=0.001 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 10.1k ± 1% 9.1k ± 1% -10.24% (p=0.000 n=10+10) ``` Release note: None spanconfig: drop 'experimental_' suffix from settings We're going to start to use this infrastructure for realsies soon. Release note: None spanconfig/testcluster: plumb in testing knobs for tenants We were previously using an empty one with no control at the caller to override specific values. This comes in handy when looking to control knobs for all tenants in these tests. Release note: None migrationsccl: disable reconciliation job creation in tests In a future commit we'll enable the span configs infrastructure by default for all crdb unit tests. Doing so will surface the need to have the reconciliation job disabled for the specific migration tests that assert on the contents of `system.span_configurations` (also written to by the reconciliation job/reconciler). Release note: None spanconfig/sqltranslator: deflake datadriven test This is a follow up to #73531, there we forgot to update package tests to also use consistent reads when looking up descriptors. Looking at the surrounding commentary in these datadriven files and comparing against the actual results, there was a mismatch (now no longer so). Release note: None sql: future-proof TestTruncatePreservesSplitPoints This test previously relied on range split decisions to happen near-instantaneously (especially for the single node variant). This was a fair assumption with the gossiped SystemConfigSpan, but is no longer true with the span configs infrastructure where (i) updated descriptors/zone configs make its way to `system.span_configurations` asynchronously, and (ii) KV learns about learns about `system.span_configurations` updates asynchronously. We update the test to be agnostic to either subsystem (tl;dr - throw in a SucceedsSoon block at the right places). Release note: None sql: future-proof TestScatterResponse This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that no longer hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of just expecting it. Release note: None kvserver: future-proof TestElectionAfterRestart This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None multiregionccl: future-proof TestEnsureLocalReadsOnGlobalTables This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None server: future-proof TestAdminAPIDatabaseDetails This is Yet Another Test that made timing assumptions on how instantaneously range split decisions appear, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for splits to appear instead of only expecting it. Release note: None changefeedccl: future-proof TestChangefeedProtectedTimestamps The low gc.ttlseconds in this test that applies to system.{descriptor,zones}, when run with span configs enabled (in a future commit), runs into errors introduced in #73086. The span configs infrastructure makes use of rangefeeds against these tables within the spanconfig.SQLWatcher process. These rangefeeds error out if the timestamp they're started with is already GC-ed, something that's very likely with low GC TTLs. To accommodate, we simply bump the TTL to a more reasonable 100s. Release note: None kvfollowerreadsccl: future-proof TestBoundedStalenessDataDriven This is Yet Another Test that made timing assumptions on how instantaneously range config decisions are applied, assumptions that don't hold under the span configs infrastructure. Adding compatibility is a matter of waiting for the right configs to appear instead of only expecting it. Release note: None changefeedccl: future-proof TestChangefeedBackfillCheckpoint This testing knob was added in #68374 but I'm not sure that it was necessary? Brief stress runs with an without this flag did not surface anything. In a future commit where we enable span configs by default, we'll actually rely on the reconciliation job running, so we get rid of this flag now. Release note: None liveness: future-proof TestNodeLivenessStatusMap Instead of using hooks that directly mutate the system config span, using SQL statements to tweak zone configs future proofs this test for compatibility with the span configs infrastructure. Release note: None sql: migrate has_database_privilege from evalPrivilegeCheck to ctx.Planner.HasPrivilege refs https://github.com/cockroachdb/cockroach/issues/66173 HasPrivilege is required to support WITH GRANT OPTION Release note: None roachtest: fix sequelize nightly Upstream changed how imports are done, so this library had to be updated. Release note: None kvstreamer: fix potential deadlock We have two different locks in the `Streamer` infrastructure: one is for the `Streamer` itself and another one for the `budget`. Previously, there was no contract about which mutex needs to be acquired first which led to the deadlock detector thinking that there is a potential deadlock situation. This is now fixed by requiring that the `budget`'s mutex is acquired first and by releasing the `Streamer`'s mutex in `Enqueue` early in order to not overlap with the interaction with the `budget`. I believe that it was a false positive (i.e. the deadlock cannot actually occur) because without the support of pipelining, `Enqueue` calls and asynchronous requests evaluation never overlap in time. Still, it's good to fix the order of mutex acquisitions. Release note: None sql: remove PHYSICAL scrub code The PHYSICAL scrub code is experimental and not considered production ready. It complicates a lot of code paths involved in normal query execution (it significantly overloads the semantics of TableReader and of the Fetcher) and is getting in the way of some improvements in how the fetchers work. In particular, we are trying to reduce the amount of information passed to TableReader/Fetcher (which in the non-scrubbing case should be a lot less than the full table descriptor). There are some proposals for a better design floating around, e.g. provide a facility for returning KVs as results from DistSQL and have some higher-level code run the scrub checks. This change removes the code for the PHYSICAL scrub for now. Release note (sql change): the experimental SCRUB PHYSICAL is no longer implemented. vars: add placeholder session variable for xmloption Release note (sql change): The `xmloption` session variable is now accepted, only taking in `content`. Note this does not do anything. clusterversion: improve a version comment Gets rid of a squiggly line in Goland. Release note: None kvserver: plumb in a context into (*Store).GetConfReader We'll use it in a future commit. Release note: None spanconfig/reconciler: export the checkpoint timestamp We'll make use of it in a future commit. Release note: None spanconfig: get rid of ReconciliationDependencies interface It was hollow, simply embedding the spanconfig.Reconciler interface. In a future commit we end up relying on each pod's span con…

75890: backupccl: add more exportable crdb data types to exportParquet r=stevendanna a=msbutler Previously, Export Parquet could only export tables with a small subset of CRDB data types. In this PR, I add support for all CRDB data types that avro changefeeds support. Specifically, data types in the OID and Void families are excluded as the former represents internal db objects and the ladder cannot be parsed. Further CRDB data types can be supported on an as need basis. This PR also adds a test that validates that EXPORT PARQUET properly encodes data in arbitrary CRDB tables with supported data types. Fixes #67710 Release note (sql change): EXPORT PARQUET now supports all data types that avro change feeds supports. Below I list the data type converstion from CRDB to Parquet. To learn about more about parquet data representation, see https://github.com/apache/parquet-format CRDB Type Family -> Parquet Type | Parquet Logical Type Bool -> boolean | nil String -> byte array | string Collated String -> byte array | string INet -> byte array | string JSON -> byte array | json Int -> int64 | nil Float -> float64 | nil Decimal -> byte array | decimal (note: scale and precision data are preserved in the parquet file) Uuid -> fixed length byte array (16 bytes) | nil Bytes -> byte array | nil Bit -> byte array | nil Enum -> byte array | Enum Box2d -> byte array | string Geography -> byte array | nil Geometry -> byte array | nil Date -> byte array | string Time -> int64 | time (note: microseconds since midnight) TimeTz -> byte array | string Interval -> byte array | string (specifically represented as ISO8601) Timestamp -> byte array | string TimestampTz -> byte array | string Array -> encoded as a repeated field and each array value gets encoded by pattern described above. 75897: jobs: enable marking jobs as idle for metrics r=samiskin a=samiskin In order to allow long-running / indefinite jobs to not continually block a tenant from being treated as inactive in a serverless environment, this change adds a MarkIdle API to the jobs registry which increases / decreases a counter of currently idle jobs. An idle job should be able to be shut down at any moment. Through the CurrentlyIdle and CurrentlyRunning jobs numbers serverless can determine if all jobs are idle and the tenant can be safely shut down. For now this is only saved in-memory in the registry object, however we may transition to saving in persisted job state as well in the future. Fixes #74747 Release note: None 76174: server: reduce allocations in setupSpanForIncomingRPC r=andreimatei a=andreimatei A couple of variables were independently escaping because they were captured by a closure. This patch groups them into a single allocation, saving two allocations per call. name old time/op new time/op delta SetupSpanForIncomingRPC-32 508ns ± 2% 442ns ± 0% -12.91% (p=0.008 n=5+5) name old alloc/op new alloc/op delta SetupSpanForIncomingRPC-32 178B ± 0% 145B ± 0% -18.54% (p=0.008 n=5+5) name old allocs/op new allocs/op delta SetupSpanForIncomingRPC-32 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.008 n=5+5) Release note: None 76214: distsql: add plumbing for SQL pods to send RPCs to each other r=rharding6373 a=rharding6373 This PR adds initial support for distributed flows in a multi-tenant cluster with multiple pods. It adds a test that assigns flows to multiple pods and executes them. In order to support communication between SQL pods in a multi-tenant environment, this PR adds a new kind of connector that can provide address resolution for SQL instances, and plumbs a node dialer that uses this new connector through distsql. If distsql is not in multi-tenant mode, the same node dialer is used whether SQL instances are communicating with other SQL instances or KV, as before this change. Release note: None 76292: server: standardize error handling across all RPC endpoints r=jocrl a=knz Arguably, there is still a lot of boilerplate in here, so a later iteration of this work could consider handling errors at a higher level. However, before we can do this we want all the API handlers to handle errors in a consistent manner, so that there's no surprise left during the later refactoring. This change also takes care of logging a few errors that were not logged before. Release note: None 76311: spanconfigsubscriber: pass ctx to KVSubscriber callbacks r=andreimatei a=andreimatei One of these callbacks, in store.go, was capturing a ctx and using it async. That was a tracing span use-after-finish. This patch fixes it by plumbing contexts and avoiding the capture. Release note: None 76313: settings: introduce system.tenant_settings table r=ajwerner a=ajwerner Part of implementing the multi-tenant cluster settings RFC #73349. I was planning to post it along with more of the implementation but I'm posting it earlier for visibility, in case we need to make decisions related to system table IDs. Release note: None 76317: util/tracing: improve a comment r=andreimatei a=andreimatei Hint that there's no race between dropping a span's lock when it's in the middle of finishing and the parent inserting that span into the registry. Release note: None 76318: util/trace: fix span reset r=andreimatei a=andreimatei We were forgetting to reset a field. Release note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

RaduBerinde · 2022-02-16T18:09:40Z

In this RFC we initially proposed a fourth class of setting, one that is system and has only one value which tenants can read. We decided to drop it.

The main difference between this (let's call it SystemVisible) and TenantReadOnly is that if you change the setting, you have to also override it for tenants (using ALTER TENANT ALL).

This came up in a recent discussion about kv.closed_timestamp.target_duration and kv.protectedts.reconciliation.interval where this class would make sense.

Should we try to add this new class or not? And if we do, should you still be able to change the value seen by tenants with ALTER TENANT?

RaduBerinde requested review from dt, knz, stevendanna, ajwerner and otan December 1, 2021 21:57

RaduBerinde requested a review from a team as a code owner December 1, 2021 21:57

ajwerner reviewed Dec 2, 2021

View reviewed changes

arulajmani reviewed Dec 2, 2021

View reviewed changes

dt reviewed Dec 2, 2021

View reviewed changes

ajwerner reviewed Dec 2, 2021

View reviewed changes

dt reviewed Dec 2, 2021

View reviewed changes

ajwerner reviewed Dec 2, 2021

View reviewed changes

RaduBerinde commented Dec 2, 2021

View reviewed changes

RaduBerinde requested a review from andy-kimball December 2, 2021 21:29

dt reviewed Dec 2, 2021

View reviewed changes

RaduBerinde commented Dec 2, 2021

View reviewed changes

dt reviewed Dec 2, 2021

View reviewed changes

RaduBerinde commented Dec 2, 2021

View reviewed changes

ajwerner reviewed Dec 2, 2021

View reviewed changes

dt reviewed Dec 2, 2021

View reviewed changes

andy-kimball reviewed Dec 2, 2021

View reviewed changes

RaduBerinde commented Dec 3, 2021

View reviewed changes

irfansharif reviewed Dec 3, 2021

View reviewed changes

dt reviewed Dec 3, 2021

View reviewed changes

craig bot merged commit 45390e7 into cockroachdb:master Dec 15, 2021

RaduBerinde mentioned this pull request Dec 15, 2021

settings: multi-tenant cluster settings tracking issue #73857

Closed

16 tasks

RaduBerinde deleted the settings-rfc branch December 16, 2021 01:39

RaduBerinde mentioned this pull request Dec 16, 2021

setting: introduce setting classes #73937

Merged

RaduBerinde mentioned this pull request Dec 23, 2021

setting: prevent use of SystemOnly settings from tenant code #74253

Merged

vy-ton reviewed Jan 7, 2022

View reviewed changes

RaduBerinde commented Jan 7, 2022

View reviewed changes

RaduBerinde mentioned this pull request Jan 20, 2022

settings: introduce system.tenant_settings table #75241

Closed

ajwerner mentioned this pull request Feb 9, 2022

settings: introduce system.tenant_settings table #76313

Merged

stevendanna mentioned this pull request Apr 1, 2022

multitenant: re-evalutate SystemVisible cluster setting class #79223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: multitenant cluster settings #73349

rfcs: multitenant cluster settings #73349

RaduBerinde commented Dec 1, 2021 •

edited by irfansharif

Loading

cockroach-teamcity commented Dec 1, 2021

ajwerner left a comment

arulajmani left a comment

dt left a comment

ajwerner left a comment

dt left a comment

ajwerner left a comment

RaduBerinde left a comment

dt left a comment

RaduBerinde left a comment

dt left a comment

RaduBerinde left a comment

ajwerner left a comment

andy-kimball commented Dec 2, 2021

ajwerner left a comment

dt left a comment

andy-kimball left a comment

andy-kimball left a comment

RaduBerinde left a comment

irfansharif left a comment

dt left a comment

dt left a comment

RaduBerinde commented Dec 3, 2021

RaduBerinde commented Dec 15, 2021

vy-ton left a comment

vy-ton Jan 7, 2022

vy-ton Jan 7, 2022

vy-ton Jan 7, 2022

vy-ton Jan 7, 2022

vy-ton Jan 7, 2022

vy-ton Jan 7, 2022

RaduBerinde left a comment

RaduBerinde commented Feb 16, 2022


		### SQL changes

		New statements for the system tenant only (which concern only `tenant-ro` and

rfcs: multitenant cluster settings #73349

rfcs: multitenant cluster settings #73349

Conversation

RaduBerinde commented Dec 1, 2021 • edited by irfansharif Loading

cockroach-teamcity commented Dec 1, 2021

ajwerner left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andy-kimball commented Dec 2, 2021

ajwerner left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

RaduBerinde commented Dec 3, 2021

RaduBerinde commented Dec 15, 2021

vy-ton left a comment

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

vy-ton Jan 7, 2022

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Feb 16, 2022

RaduBerinde commented Dec 1, 2021 •

edited by irfansharif

Loading