Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: multitenant cluster settings #73349

Merged
merged 1 commit into from
Dec 15, 2021
Merged

Conversation

RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde commented Dec 1, 2021

Release note: None


Link to RFC text.

@RaduBerinde RaduBerinde requested a review from a team as a code owner December 1, 2021 21:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 45 at r1 (raw file):

 - in certain cases tenant code may need to consult values for cluster settings
   that apply to the host cluster: for example
   `kv.closed_timestamp.follower_reads_enabled` applies to the KV subsystem but

I think that this setting is a good example of something that probably shouldn't be a cluster setting but rather should be a zone configuration. See #70614

This is sort of irrelevant to this proposal; anything we move to zone/span configs, we don't need to think about as settings. However, @irfansharif and @arulajmani, we do need to think about how the tenants can read the defaults coming out of the host tenant that apply to them.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 45 at r1 (raw file):

we do need to think about how the tenants can read the defaults coming out of the host tenant that apply to them.

This would be a natural extension to what @adityamaru and I are planning on proposing in the (soon to be out) protected timestamp RFC. We're proposing a way to allow the host tenant to configure span configuration fields that apply to all/particular secondary tenants. The idea is to do this by introducing a span configuration hierarchy in the host tenant that is hydrated on demand. This can be extended to give the host tenant the ability to set defaults or overrides for particular fields -- seems like kv.closed_timestamp.follower_reads_enabled could use this if we build a listening mechanism similar in principle to what is described for System visible below.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

   An alternative approach previously suggested in #68406 would be to store
   these settings on the host side (for all tenants, in a new system table).
   However, this would be considerably more work.

Hm. Considerably more initial work perhaps -- though even then, I'm not sure that it is that bad if we're already extending the TenantConnector above, to provide System-visible settings to the tenant, to go ahead and also hook it up to a scan of tenant-specific settings values). But what I think might be missing from this "considerably more work" determination is that we will have an ongoing cost to maintain the new requirement that there are keys in the tenant's SQL table keyspace that the tenant cannot write by any means. This seems a pretty significant, and easy to mess up (e.g. as mentioned above via RESTORE, or who knows what else we might overlook). I'm very apprehensive about this aspect of this proposal.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

4. Tenant

   This class is consistent with the current semantics (it's already

If we're building out the tenant connector to provide setting values from the host cluster anyway, and we extend that implementation as I suggest above to include looking up tenant-specific values, I think we could then have a potentially really useful feature for settings in this class as well, which would be the ability for the host cluster to choose at runtime to override any of these settings with a host-set value, essentially by creating an override for tenant X of setting Y, it would temporarily move that setting from this class up to tenant read-only. This seems like it could be quite useful to SREs/operators of MT clusters to respond to incidents until a code change can be deployed.

Indeed, I almost wonder if these two classes even need to be distinct in the code or if the difference between tenant r/w and tenant r/o is just whether or not a host-cluster override exists.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Hm. Considerably more initial work perhaps -- though even then, I'm not sure that it is that bad if we're already extending the TenantConnector above, to provide System-visible settings to the tenant, to go ahead and also hook it up to a scan of tenant-specific settings values). But what I think might be missing from this "considerably more work" determination is that we will have an ongoing cost to maintain the new requirement that there are keys in the tenant's SQL table keyspace that the tenant cannot write by any means. This seems a pretty significant, and easy to mess up (e.g. as mentioned above via RESTORE, or who knows what else we might overlook). I'm very apprehensive about this aspect of this proposal.

It's not the initial scan I care about, it's keeping it up-to-date that I care about. What's the proposal there? We add a streaming RPC backed by a rangefeed? That's not crazy.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

It's not the initial scan I care about, it's keeping it up-to-date that I care about. What's the proposal there? We add a streaming RPC backed by a rangefeed? That's not crazy.

we're already proposing that having streaming RPC, for the system-visible ones above, so I'm just proposing it also include tenant-specific values. Presumably that RPC is already backed by a rangefeed on system.settings, for the system-visible class, so we could either just have a second rangefeed, on a new guest-tenant-settings table, or we could figure out a way to have guest tenant settings overrides stored in the host's system.settings (name prefix, new column, etc).

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we're already proposing that having streaming RPC, for the system-visible ones above, so I'm just proposing it also include tenant-specific values. Presumably that RPC is already backed by a rangefeed on system.settings, for the system-visible class, so we could either just have a second rangefeed, on a new guest-tenant-settings table, or we could figure out a way to have guest tenant settings overrides stored in the host's system.settings (name prefix, new column, etc).

I suppose that that's true. When we first talked about it, I was thinking we could hijack the existing gossip connector, but I want to kill that thing so yeah.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

I suppose that that's true. When we first talked about it, I was thinking we could hijack the existing gossip connector, but I want to kill that thing so yeah.

I agree about the ongoing cost.

One aspect to consider is how unified the code and logic for the system and non-system tenants will be. Using the tenant's settings table relies on most of the existing system tenant code. We would probably want to switch the system tenant to use the same mechanism for its settings.

There is something to the idea of somehow extending the settings table, I will give that some thought.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

Previously, dt (David Taylor) wrote…

If we're building out the tenant connector to provide setting values from the host cluster anyway, and we extend that implementation as I suggest above to include looking up tenant-specific values, I think we could then have a potentially really useful feature for settings in this class as well, which would be the ability for the host cluster to choose at runtime to override any of these settings with a host-set value, essentially by creating an override for tenant X of setting Y, it would temporarily move that setting from this class up to tenant read-only. This seems like it could be quite useful to SREs/operators of MT clusters to respond to incidents until a code change can be deployed.

Indeed, I almost wonder if these two classes even need to be distinct in the code or if the difference between tenant r/w and tenant r/o is just whether or not a host-cluster override exists.

How is this functionality different from the proposed SET TENANT CLUSTER SETTING? Is it that not only can the host change the setting, but it can also (temporarily) prevent the tenant from modifying it?

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 152 at r1 (raw file):

Previously, RaduBerinde wrote…

How is this functionality different from the proposed SET TENANT CLUSTER SETTING? Is it that not only can the host change the setting, but it can also (temporarily) prevent the tenant from modifying it?

right -- I'm proposing the host cluster can set any setting using that syntax, not just those explicitly put in the "tenant read-only" class ahead of time, since the tenant pod would always read the host's exported override before its own local value. Thus SET TENANT CLUSTER SETTING could be used in an incident to override any setting, and it wouldn't matter if the guest tenant had their own value/tried to change it/etc.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

I agree about the ongoing cost.

One aspect to consider is how unified the code and logic for the system and non-system tenants will be. Using the tenant's settings table relies on most of the existing system tenant code. We would probably want to switch the system tenant to use the same mechanism for its settings.

There is something to the idea of somehow extending the settings table, I will give that some thought.

In this "host-side" proposal - how do tenants change a cluster setting? A tenant connector API?

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

In this "host-side" proposal - how do tenants change a cluster setting? A tenant connector API?

Oh, I was proposing host-side storage only for the host-set values -- tenant-set values would still be written to their local system.settings table and that value would be read into their settings.Values, all just like it is today.

What I was suggesting is if the host wants to force a given setting to have a given value for a specific tenant, it can write that to some table in the host key span, either a special tenant settings table or some prefix of its system.settings table, and then that value flows over the tenant connector to the tenant (just like "host-visible" settings are described as doing above), and this would cause the in-mem setting updater, in the tenant pod, to use that value instead of looking in its local table.

Basically today, the value of a setting is if (has value in system.settings)? { that value } else { default } and I'd extend that to be if (got value for this setting from host via connector?) { that value } else if (has value in (my) system.settings?) {that value} else { default }.

With that change to the settings updater, we could implement permanent tenant read-only settings (just write a value in the host) that are reliably stored out of mutation reach of the tenant, as well as temporary overrides for any (normally tenant-setable) tenant setting which seems nice.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Oh, I was proposing host-side storage only for the host-set values -- tenant-set values would still be written to their local system.settings table and that value would be read into their settings.Values, all just like it is today.

What I was suggesting is if the host wants to force a given setting to have a given value for a specific tenant, it can write that to some table in the host key span, either a special tenant settings table or some prefix of its system.settings table, and then that value flows over the tenant connector to the tenant (just like "host-visible" settings are described as doing above), and this would cause the in-mem setting updater, in the tenant pod, to use that value instead of looking in its local table.

Basically today, the value of a setting is if (has value in system.settings)? { that value } else { default } and I'd extend that to be if (got value for this setting from host via connector?) { that value } else if (has value in (my) system.settings?) {that value} else { default }.

With that change to the settings updater, we could implement permanent tenant read-only settings (just write a value in the host) that are reliably stored out of mutation reach of the tenant, as well as temporary overrides for any (normally tenant-setable) tenant setting which seems nice.

Wouldn't it be simpler though to not maintain two separate mechanisms/locations? If we allow the ability to store any setting on the host side, why not store all of them there?

So I guess I'm currently favoring this design:

  • system.settings becomes a table that only exists in / is only used by the system tenant.
  • the table is enhanced to be able to store per-tenant values of certain settings. We could do this e.g. by prepending the tenant ID to the cluster setting name.
  • on each KV node, the system tenant sets up a range feed on the entire table and maintains all setting values (for all tenants).
  • the tenant connector has new APIs to listen for changes to cluster settings, and to set cluster settings.

Thoughts on this?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, RaduBerinde wrote…

Wouldn't it be simpler though to not maintain two separate mechanisms/locations? If we allow the ability to store any setting on the host side, why not store all of them there?

So I guess I'm currently favoring this design:

  • system.settings becomes a table that only exists in / is only used by the system tenant.
  • the table is enhanced to be able to store per-tenant values of certain settings. We could do this e.g. by prepending the tenant ID to the cluster setting name.
  • on each KV node, the system tenant sets up a range feed on the entire table and maintains all setting values (for all tenants).
  • the tenant connector has new APIs to listen for changes to cluster settings, and to set cluster settings.

Thoughts on this?

What's the rate limiting/abuse prevention mechanism?

Some reason to keep two copies are:

  1. there's never any risk of a tenant trying to write to the system tenant
    • Helps with the abuse prevention
  2. we can't currently split the settings table and every time we write to it it gets gossipped (we do plan to stop this this release but it seems bad to couple this architectural change)
    • Settings are usually small, so this probably doesn't matter too too much.
  3. It makes it possible to tell a user story where the user has had a setting explicitly overridden; there's no possibility of a game where we try to set something and they try to set it back
    • I worry about cases where we made the wrong choice about whether we want to let the client set it.

@andy-kimball
Copy link
Contributor


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

What's the rate limiting/abuse prevention mechanism?

Some reason to keep two copies are:

  1. there's never any risk of a tenant trying to write to the system tenant
    • Helps with the abuse prevention
  2. we can't currently split the settings table and every time we write to it it gets gossipped (we do plan to stop this this release but it seems bad to couple this architectural change)
    • Settings are usually small, so this probably doesn't matter too too much.
  3. It makes it possible to tell a user story where the user has had a setting explicitly overridden; there's no possibility of a game where we try to set something and they try to set it back
    • I worry about cases where we made the wrong choice about whether we want to let the client set it.

I think backup/restore would mess this up. Per-tenant settings should be transactional (right?), in which case if I restore to an earlier version of my tenant cluster, I'd expect all per-tenant settings to be reset to what's in the backup. But if we're storing the per-tenant settings in the system tenant, this won't work properly.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Per-tenant settings should be transactional (right?),

That is a really big deal I forgot about, the bookkeeping for migrations is, at its core, rooted in the cluster setting state for the version key. We need some amount of control of atomicity on that.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, ajwerner wrote…

Per-tenant settings should be transactional (right?),

That is a really big deal I forgot about, the bookkeeping for migrations is, at its core, rooted in the cluster setting state for the version key. We need some amount of control of atomicity on that.

Hm. I sorta like the separation that two places gives us, in that it means tenant-written thing remained confined to the tenant span, while host-written things are never in the tenant span. That separation seems nice for a couple reasons: a backup of the tenant span captures all settings they set for example along with everything else they’ve written. It also means we don’t need to worry about an abuse angle, where you could store data in settings without it counting towards your total stored data, and being tenant-initiated writes being confined to writing to tenant spans also sidesteps questions like Dos-ing the host’s table with big requests or something. I’m sure we could mitigate each of these — teach backup to go find more tenant-owned data outside the tenant, add rate limits and such — but it seems like just confining tenant writes to tenant span as a hard rule is simpler to maintain, at the cost of yeah, reading two things.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 73 at r1 (raw file):

   tenants see the same value for these settings. Changing one of these settings
   (from the system tenant) results in all tenants (asynchronously) getting the
   updated value.

Are these also read-only then? If so, I'd add that explicitly in the text.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 83 at r1 (raw file):

   via a new statement:
 SET TENANT <tenant> CLUSTER SETTING <setting> = <value>

I'm not sure how much control we have, but I think this would be more consistent with how SQL syntax works elsewhere:

SET CLUSTER SETTING <setting> = <value> [FOR TENANT <tenant>]

docs/RFCS/20211106_multitenant_cluster_settings.md, line 85 at r1 (raw file):

     SET TENANT <tenant> CLUSTER SETTING <setting> = <value>

We will also add a SHOW [ALL] TENANT <tenant> CLUSTER SETTINGS statement.

Similarly:

SHOW [ALL] CLUSTER SETTINGS [FOR TENANT <tenant>]

docs/RFCS/20211106_multitenant_cluster_settings.md, line 115 at r1 (raw file):

   There is not much work to do here, other than fully hiding these settings
   from non-system tenants. The cluster settings subsystem will not allow
   accessing these values from a tenant process (it will crash the tenant

This sounds like just trying to get or set these values will crash the process, which doesn't sound right? There shouldn't be a way within SQL to force crash a process.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 130 at r1 (raw file):

   Values for these settings do not exist in tenant instances of the
   `system.settings` table. They only live in the system tenant's instance, with

How would SHOW ALL CLUSTER SETTINGS work? Would it show "System visible" setting as well as tenant settings? Does it have to union the settings in the tenant's system.settings table with ones it gets from the KV connector?


docs/RFCS/20211106_multitenant_cluster_settings.md, line 180 at r1 (raw file):

infrastructure as for `Tenant read-only` settings. This has the advantage of not
having to implement any new infrastructure, but it would complicate things on
the host side - when changing any such setting, we would have to change the

Is this so bad? How often do we change/add these settings? Could we use long-running migrations when we do (assuming it's very infrequent)?

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Hm. I sorta like the separation that two places gives us, in that it means tenant-written thing remained confined to the tenant span, while host-written things are never in the tenant span. That separation seems nice for a couple reasons: a backup of the tenant span captures all settings they set for example along with everything else they’ve written. It also means we don’t need to worry about an abuse angle, where you could store data in settings without it counting towards your total stored data, and being tenant-initiated writes being confined to writing to tenant spans also sidesteps questions like Dos-ing the host’s table with big requests or something. I’m sure we could mitigate each of these — teach backup to go find more tenant-owned data outside the tenant, add rate limits and such — but it seems like just confining tenant writes to tenant span as a hard rule is simpler to maintain, at the cost of yeah, reading two things.

I agree that we should keep per-tenant stuff in the tenant keyspace. Not doing so will lead us down a similar dark path to the one we've been trying to escape by separating logical SQL info from physical KV info as part of the multi-tenancy project. System tenant info belongs in the system tenant keyspace; regular tenant info belongs in the regular tenants' keyspaces. Otherwise, we have to do all sorts of nasty fixup when we backup/restore, stream clusters from one host to another, etc.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 115 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

This sounds like just trying to get or set these values will crash the process, which doesn't sound right? There shouldn't be a way within SQL to force crash a process.

They would not exist as far as the user is concerned (i.e. they would get an "invalid cluster setting" error).


docs/RFCS/20211106_multitenant_cluster_settings.md, line 130 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

How would SHOW ALL CLUSTER SETTINGS work? Would it show "System visible" setting as well as tenant settings? Does it have to union the settings in the tenant's system.settings table with ones it gets from the KV connector?

Yes, it would union them.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 180 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Is this so bad? How often do we change/add these settings? Could we use long-running migrations when we do (assuming it's very infrequent)?

The problem is also that we have to keep them in sync, e.g. if a backup restores older values. Anyway, I think it's moot, because it looks like we're moving in the direction of tenant read-only also being stored on the host side anyway

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andy-kimball, @dt, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I agree that we should keep per-tenant stuff in the tenant keyspace. Not doing so will lead us down a similar dark path to the one we've been trying to escape by separating logical SQL info from physical KV info as part of the multi-tenancy project. System tenant info belongs in the system tenant keyspace; regular tenant info belongs in the regular tenants' keyspaces. Otherwise, we have to do all sorts of nasty fixup when we backup/restore, stream clusters from one host to another, etc.

I'm having some trouble following this thread. I've tried summarizing something below, with some questions around how many places a setting's value can be found. Is this what we're thinking?

  1. system.settings exists for all tenants, including the host tenant. It stores within in all tenant read+writable cluster settings. Example: sql.notices.enabled. For the host tenant, it would also include the "system hidden" settings, such as kv.allocator.qps_rebalance_threshold.

  2. system.tenant_settings (or similar) to store tenant-scoped read-only settings. Table exists only on the host tenant, writable only by the host tenant. Each tenant pod would maintain a rangefeed to the relevant portion of the table to asynchronously received such updates. In each tenant pod, when reading the value of a cluster setting, it would stitch the values from (1) and (2) together. Are these settings disjoint? Or do we want the value from the host tenant to act as a break-the-glass override?

  3. system.global_settings (or similar) on the host tenant storing "system visible" settings, visible to all secondary tenant through the connector/rangefeed. For the host tenant, it's just another table to consult in addition to system.global_settings. Are the settings in this table are disjoint from (1) and (2)? Or do it want it to also act as a global override of sorts?

If so, SGTM. It gives us txn-al semantics for all tenants (secondary or otherwise) over the settings it can write to. Backup/restores also work as you'd expect, affecting only the tenant's keyspace. When restoring the host tenant, I'm not sure if we restore secondary tenant metadata (system.tenants etc.) -- we must, right? That too would behave as expected. Fits in well with our thread model.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm having some trouble following this thread. I've tried summarizing something below, with some questions around how many places a setting's value can be found. Is this what we're thinking?

  1. system.settings exists for all tenants, including the host tenant. It stores within in all tenant read+writable cluster settings. Example: sql.notices.enabled. For the host tenant, it would also include the "system hidden" settings, such as kv.allocator.qps_rebalance_threshold.

  2. system.tenant_settings (or similar) to store tenant-scoped read-only settings. Table exists only on the host tenant, writable only by the host tenant. Each tenant pod would maintain a rangefeed to the relevant portion of the table to asynchronously received such updates. In each tenant pod, when reading the value of a cluster setting, it would stitch the values from (1) and (2) together. Are these settings disjoint? Or do we want the value from the host tenant to act as a break-the-glass override?

  3. system.global_settings (or similar) on the host tenant storing "system visible" settings, visible to all secondary tenant through the connector/rangefeed. For the host tenant, it's just another table to consult in addition to system.global_settings. Are the settings in this table are disjoint from (1) and (2)? Or do it want it to also act as a global override of sorts?

If so, SGTM. It gives us txn-al semantics for all tenants (secondary or otherwise) over the settings it can write to. Backup/restores also work as you'd expect, affecting only the tenant's keyspace. When restoring the host tenant, I'm not sure if we restore secondary tenant metadata (system.tenants etc.) -- we must, right? That too would behave as expected. Fits in well with our thread model.

  1. Yes. Every tenant, system included, has a system.settings table that they can read from and write to -- it is theirs.
  2. / 3) I think are TBD on exactly whether or not those are distinct from each other, or even if they are distinct from the system tenant's system.settings if it learned to store additional copies of settings it wanted to apply to guest tenants. I don't think we need a separate global_settings at least.

My straw man is we make one new system table and it looks something like this:
CREATE TABLE system.tenant_setting_overrides (tenant_id INT NOT NULL, setting STRING NOT NULL, value BYTES NULL, reason STRING, PRIMARY KEY (tenant_id, setting)).

In this table, a tenant_id of 0 would indicate an override which applies to all tenants, while a ID >0 indicates that it applies to that specific tenant.

The tenant connector RPC then opens two* rangefeeds: one on the host's system.settings, and one on this table.
When a host system.settings row changes, if that row is for a "host visible" setting, its value is emitted over the connector.
When system.tenant_settings_overrides row changes, if its tenant_id was either 0 or the connected tenant's ID, its new value (or deletion thereof) and whether it was global (id=0) or specific (id=tenant) is emitted over the connector.

  • = could make this three rangefeeds if we wanted to separately listen to the id=0 prefix and id=tenant prefixes of the overrides table to ignore irrelevant updates, as a crude form of predicate pushdown.

If we want all tenants to use some new value for a setting, i.e. make it "tenant read-only" but with same value for all tenants (which differs from host-visible in that that value is not necessarily the host's value, but isn't tenant-specific either), we'd `INSERT INTO system.tenant_settings_overrides (0, some-setting, some-value, 'optionally explain why').
If we then want tenant 42 to actually use its own, specific value for that setting, ditto but 42 instead of 0.

Note above I said that when we emit to the tenant the override value for a specific setting, we were including if that override was specific to them or global. Obviously we could not do that, and just send them the specific if it exists, or the global if not. However by sending them both, if we can say the tenant hangs on to both, when we clear a specific override, we don't have to then dig up the all-tenant value, if there is one, for them to fall back to. Instead, we the tenant settings updater can just be asked to persist both overrides -- tenant-specific and tenant-global -- separately whenever they are updated, and then always use specific, if set, first.

An alternative to system.tenant_setting_overrides would be to extend the schema of system.settings to include an extra column for tenant_id so that in host clusters it also stores guest tenant overrides, e.g. you could make the PK tenant_id, setting but this seems a little messier: say you then store your own settings (i.e. when you run SET CLUSTER SETTING) under your own tenant_id, i.e. as if they were overrides for you. Now this table isn't portable between tenant ids, which is gross for restore. I suppose you could instead say you store your own under 1, since it is reserved, but that's sorta gross too, and you can't use null in a pk. I dunno, I'm sure we could figure it out, but it doesn't seem compelling to me to mix them: system.settings are "my" settings; the ones I set that affect how I run. system.tenant_settings_overrides are values I tell tenants about, but I don't use to control my own behavior. These seem like separate things, so I don't particularly see an advantage to mixing them.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @andy-kimball, @irfansharif, @knz, @otan, @RaduBerinde, and @stevendanna)


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…
  1. Yes. Every tenant, system included, has a system.settings table that they can read from and write to -- it is theirs.
  2. / 3) I think are TBD on exactly whether or not those are distinct from each other, or even if they are distinct from the system tenant's system.settings if it learned to store additional copies of settings it wanted to apply to guest tenants. I don't think we need a separate global_settings at least.

My straw man is we make one new system table and it looks something like this:
CREATE TABLE system.tenant_setting_overrides (tenant_id INT NOT NULL, setting STRING NOT NULL, value BYTES NULL, reason STRING, PRIMARY KEY (tenant_id, setting)).

In this table, a tenant_id of 0 would indicate an override which applies to all tenants, while a ID >0 indicates that it applies to that specific tenant.

The tenant connector RPC then opens two* rangefeeds: one on the host's system.settings, and one on this table.
When a host system.settings row changes, if that row is for a "host visible" setting, its value is emitted over the connector.
When system.tenant_settings_overrides row changes, if its tenant_id was either 0 or the connected tenant's ID, its new value (or deletion thereof) and whether it was global (id=0) or specific (id=tenant) is emitted over the connector.

  • = could make this three rangefeeds if we wanted to separately listen to the id=0 prefix and id=tenant prefixes of the overrides table to ignore irrelevant updates, as a crude form of predicate pushdown.

If we want all tenants to use some new value for a setting, i.e. make it "tenant read-only" but with same value for all tenants (which differs from host-visible in that that value is not necessarily the host's value, but isn't tenant-specific either), we'd `INSERT INTO system.tenant_settings_overrides (0, some-setting, some-value, 'optionally explain why').
If we then want tenant 42 to actually use its own, specific value for that setting, ditto but 42 instead of 0.

Note above I said that when we emit to the tenant the override value for a specific setting, we were including if that override was specific to them or global. Obviously we could not do that, and just send them the specific if it exists, or the global if not. However by sending them both, if we can say the tenant hangs on to both, when we clear a specific override, we don't have to then dig up the all-tenant value, if there is one, for them to fall back to. Instead, we the tenant settings updater can just be asked to persist both overrides -- tenant-specific and tenant-global -- separately whenever they are updated, and then always use specific, if set, first.

An alternative to system.tenant_setting_overrides would be to extend the schema of system.settings to include an extra column for tenant_id so that in host clusters it also stores guest tenant overrides, e.g. you could make the PK tenant_id, setting but this seems a little messier: say you then store your own settings (i.e. when you run SET CLUSTER SETTING) under your own tenant_id, i.e. as if they were overrides for you. Now this table isn't portable between tenant ids, which is gross for restore. I suppose you could instead say you store your own under 1, since it is reserved, but that's sorta gross too, and you can't use null in a pk. I dunno, I'm sure we could figure it out, but it doesn't seem compelling to me to mix them: system.settings are "my" settings; the ones I set that affect how I run. system.tenant_settings_overrides are values I tell tenants about, but I don't use to control my own behavior. These seem like separate things, so I don't particularly see an advantage to mixing them.

I sorta handwaved this above, but the consumer of this connector's stream of overrides would be the settingsUpdater in tenant pods.

My straw man there is we extend settings.Values to have a couple layers of atomic value in each slow, e.g. hostOverride, hostGlobal, local. local is just the current only -- whatever my settingsUpdater read from my system.settings, or the code-default if not (so it always has a value). any read of a setting takes the first one that is set. As an optimization, if the two extra atomic reads are too expensive, we could easily keep a flattened currentValue container too.

As described above, when the updater sees an override come in on the connector, it updates the corresponding container. As mentioned above, we keep separate hostOverride and hostGlobal containers so that if the host sends {TenantSpecific, 'setting-foo', &'value-bar'}, then {AllTenants, 'setting-foo', &'value-baz'} and then finally just sends a {TenantSpecific, 'setting-foo', nil} when our override is deleted, we want to be able to just clear hostOverride and then have hostGlobal containing value-baz take over, instead of demanding the host go look up/re-emit value-baz for us.

@RaduBerinde
Copy link
Member Author


docs/RFCS/20211106_multitenant_cluster_settings.md, line 148 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I sorta handwaved this above, but the consumer of this connector's stream of overrides would be the settingsUpdater in tenant pods.

My straw man there is we extend settings.Values to have a couple layers of atomic value in each slow, e.g. hostOverride, hostGlobal, local. local is just the current only -- whatever my settingsUpdater read from my system.settings, or the code-default if not (so it always has a value). any read of a setting takes the first one that is set. As an optimization, if the two extra atomic reads are too expensive, we could easily keep a flattened currentValue container too.

As described above, when the updater sees an override come in on the connector, it updates the corresponding container. As mentioned above, we keep separate hostOverride and hostGlobal containers so that if the host sends {TenantSpecific, 'setting-foo', &'value-bar'}, then {AllTenants, 'setting-foo', &'value-baz'} and then finally just sends a {TenantSpecific, 'setting-foo', nil} when our override is deleted, we want to be able to just clear hostOverride and then have hostGlobal containing value-baz take over, instead of demanding the host go look up/re-emit value-baz for us.

@irfansharif on 2, tenant pods can't set up rangefeeds outside of their tenant-prefixed KV span. Each KV node would need to set up this range feed and be ready to serve the values via the tenant connector.

@craig craig bot merged commit 45390e7 into cockroachdb:master Dec 15, 2021
@RaduBerinde
Copy link
Member Author

I opened #73857 to track the various implementation pieces. Please take a look and let me know if I missed anything.

@RaduBerinde RaduBerinde deleted the settings-rfc branch December 16, 2021 01:39
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 16, 2021
This commit introduces the three setting classes in the RFC (cockroachdb#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 17, 2021
This commit introduces the three setting classes in the RFC (cockroachdb#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 20, 2021
This commit introduces the three setting classes in the RFC (cockroachdb#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 20, 2021
This commit introduces the three setting classes in the RFC (cockroachdb#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 21, 2021
72992: util/log: fix redactability of logging tags r=abarganier a=knz

Fixes #72905.

Some time in the v21.2 cycle, the log entry preparation logic was
refactored and a mistake was introduced: the logging tags were not any
more subject to the redaction logic. The result was that redaction
markers were missing in log tag values, and if a value had contained
unbalanced redaction markers in a value string (say, as part of a SQL
table key), it would have caused log file corruption and possibly a
confidential data leak.

This patch fixes that, by preparing the logging tags in the same way
as the main message for each entry.

Release note (cli change): A bug affecting the redactability of
logging tags in output log entries has been fixed. This bug had
been introduced in the v21.2 release.

73937: setting: introduce setting classes r=RaduBerinde a=RaduBerinde

This commit introduces the three setting classes in the RFC (#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None

73978: opt: fix like escape processing for span constraints r=cucaroach a=cucaroach

Fixes: #44123

Previously no attempt was made to properly handle escape ('\\') sequence
in like patterns being turned into constraints. Refactor code used to
process like at runtime to generate a regexp and use that to properly
handle index constraint generation.

Release note (sql change): Escape character processing was missing from
constraint span generation which resulted in incorrect results when
doing escaped like lookups.

74102: sql: do not fetch virtual columns during backfill r=mgartner a=mgartner

Fixes #73372

Release note (bug fix): A bug has been fixed that caused internal errors
when altering the primary key of a table. The bug was only present if
the table had a partial index with a predicate that referenced a virtual
computed column. This bug was present since virtual computed columns
were added in version 21.1.0.

74110: bazel: require setting `cockroach_cross=y` to opt into cross toolchains r=irfansharif a=rickystewart

With `--incompatible_enable_cc_toolchain_resolution` set in #73819, now
Bazel selects the appropriate toolchain for you. Bazel was selecting the
`cross_linux_toolchain` when building for the host platform on Linux,
resulting in link errors when trying to compile `stress` under `race`.
We update the toolchains to instead require opting into the cross
toolchains by defining `cockroach_cross=y`.

Closes #73997.

Release note: None

74111: bench/rttanalysis: allow roundtrips to be off by 1 r=ajwerner a=ajwerner

If we don't have a range, let the currently estimate be wrong by 1.
We mostly care about the ballpark and the growth rate. I'm sick of
these flakes.

Fixes #73884.

Release note: None

74152: ci: don't delete `test.json.txt` after processing r=tbg a=rickystewart

We've seen some issues where Bazel jobs are failing in `github-post`
(#73841, #74013) with the following output:

    found outstanding output. Considering last test failed:

It's hard to say what the problem is because these scripts haven't kept
the `test.json.txt` in `artifacts`. Here I remove the logic to clean up
the file so we can RC further instances of the problem.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 23, 2021
This change implements the `system-only` semantics in the RFC
(cockroachdb#73349).

All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.

Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.
gustasva pushed a commit to gustasva/cockroach that referenced this pull request Jan 4, 2022
This commit introduces the three setting classes in the RFC (cockroachdb#73349):
`SystemOnly`, `TenantReadOnly`, and `TenantWritable`. The `SystemOnly`
class replaces the existing `WithSystemOnly()`.

In this change we don't yet implement the advertised semantics. We
mechanically use `TenantWritable` for all settings except those that
were using `WithSystemOnly()` which use `SystemOnly`; this should not
change any existing behavior. The classes will be revisited in a
separate change, after we implement the semantics.

Release note: None
Copy link
Contributor

@vy-ton vy-ton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/comments as I make my way through my RFC review backlog from the PM perspective.


New semantics for existing statements for tenants:
- `SHOW [ALL] CLUSTER SETTINGS` shows the `tenant-ro` and `tenant-rw` settings.
`Tenant-ro` settings that have an override from the host side are marked as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenant-rw settings can also have an override and should also be marked differently?

- `ALTER TENANT <id> RESET CLUSTER SETTING <setting>`
- Resets the tenant setting. For `tenant-ro`, the value reverts to the `ALL`
value (if it is set), otherwise to the setting default. For `tenant-rw`,
the value reverts to the `ALL` value (if it is set), otherwise to whatever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding, we will have 3 possible values for each setting?

  • default value
  • ALL value
  • tenant specific value(s)


### SQL changes

New statements for the system tenant only (which concern only `tenant-ro` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a single-tenant clusters, will these statements return an error then?

PRIMARY KEY (tenant_id, name)
)
```
This table is only used on the system tenant. All-non-system-tenant values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All-non-system-tenant values are stored in tenant_id=0.

Why are they not stored under different tenant_id?

active.

The settings on the system tenant will continue to work. On non-system tenants,
any locally changed settings that are now `system` or `tenant-rw` will revert to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be system or tenant-ro, no? Since those are the ones non-system tenants can no longer modify

`tenant-rw` setting.

- control settings relevant to tenant-specific internal implementation (like
tenant throttling) that we want to be able to control per-tenant should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will system settings need to be controlled per tenant? I thought they only applied to the host cluster

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20211106_multitenant_cluster_settings.md, line 101 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

On a single-tenant clusters, will these statements return an error then?

The statements that require a valid tenant id would error out since there is no valid id that you can use. ALTER TENANT ALL would work (with no consequence). There is no way to differentiate between a single-tenant cluster and a host cluster that has no tenants created yet.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 117 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

To check my understanding, we will have 3 possible values for each setting?

  • default value
  • ALL value
  • tenant specific value(s)

Conceptually there are four values, as follows:

  1. tenant-specific value set by the host cluster (via ALTER TENANT <id> SET CLUSTER SETTING).
  2. all-tenant value set by the host cluster (via ALTER TENANT ALL SET CLUSTER SETTING).
  3. value set by the tenant (only for tenant-rw!).
  4. default value (hardcoded).

This is the order of precedence - the top-most that is set is the visible value.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 135 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

tenant-rw settings can also have an override and should also be marked differently?

I think it's a typo, it was meant to say tenant-rw. For tenant-ro all settings are effectively overrides (but we can make that clear in the statement output too).


docs/RFCS/20211106_multitenant_cluster_settings.md, line 166 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

All-non-system-tenant values are stored in tenant_id=0.

Why are they not stored under different tenant_id?

This is referring to the ALL value (ALTER TENANT ALL SET CLUSTER SETTING). ALL values are stored in the table under tenant_id=0.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 199 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

This should be system or tenant-ro, no? Since those are the ones non-system tenants can no longer modify

Yes, should be tenant-ro.


docs/RFCS/20211106_multitenant_cluster_settings.md, line 217 at r8 (raw file):

Previously, vy-ton (Vy Ton) wrote…

Will system settings need to be controlled per tenant? I thought they only applied to the host cluster

Right, should have been tenant-ro, it has been fixed since (by #73937).

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Jan 10, 2022
This change implements the `system-only` semantics in the RFC
(cockroachdb#73349).

All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.

Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.
craig bot pushed a commit that referenced this pull request Jan 10, 2022
74253: setting: prevent use of SystemOnly settings from tenant code r=RaduBerinde a=RaduBerinde

#### setting: clean up slot indexes

The internal setting code passes around slot indexes as bare integers,
and there are confusingly two variants: one is 1-indexed, another is
0-indexed.

This change makes all slot indexes 0-indexed and adds a `slotIdx`
type.

Note: the 1-indexed choice was perhaps useful at the time to help find
code paths where the slot index is not initialized, but now the slot
index is set along with other mandatory fields by `init()`.

Release note: None

#### setting: prevent use of SystemOnly settings from tenant code

This change implements the `system-only` semantics in the RFC
(#73349).

All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.

Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.


74575: build: fix Pebble nightly benchmarks r=nicktrav a=jbowens

The Pebble nightly benchmarks stopped running due to a build error
stemming from the removal of the generated code from the codebase.

Release note: None

74598: scpb and scplan refactor r=postamar a=postamar

This change contains two refactoring commits:
1. `scpb: move Node to screl, introduce scpb.TargetState`. This commit is best reviewed by looking at the changes in scpb first, and then in the protos, and then the rest. The most subsequent complexity is around the event logging and in the stage builder, which benefitted from a deeper rewrite. 
2. `scplan: make child packages internal, including scgraph and scgraphviz`. This just moves stuff around, although there are a few notable changes in `plan.go`.


74639: authors: add gtr to authors r=gtr a=gtr

authors: add gtr to authors

Release note: None

Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Gerardo Torres <[email protected]>
RajivTS added a commit to RajivTS/cockroach that referenced this pull request Jan 17, 2022
Updated planProjectionOperators and planSelectionOperators to include case for tree.NotExpr

Replaced RunTests with RunTestsWithoutAllNullsInjection

Fixes for build & lint issues

Removed unused field

Addressing PR review comments

backupccl: use SpanGroup.Sub instead of CoveringMerge

Release note: none.

rowenc: reduce reliance on the entire TableDescriptor

This change reduces uses of the entire TableDescriptor.

Release note: None

descpb: add and use NoColumnID constant

This change adds a `descpb.NoColumnID` constant so we don't have to
use `descpb.ColumnID(encoding.NoColumnID)` everywhere.

Release note: None

rowenc: remove DatumAlloc.scratch

This commit removes the `DatumAlloc.scratch` field which is always
nil.

Release note: None

sql: move `DatumAlloc` to sem/tree

This commit moves `DatumAlloc` from `sql/rowenc` to `sql/sem/tree`.
This is a fairly low-level construct and is not used exclusively for
row encoding/decoding.

Release note: None

rowenc: move low-level key encoding functions to subpackage

The `rowenc` package contains a hefty mix of functions operating at
different conceptual levels - some use higher level constructs like
table and index descriptors, others are lower level utilities. The
naming of these functions doesn't help, for example `EncodeTableKey`
sounds like a top-level function but is actually a low level utility
that appends a single value to a key

This commit moves the lower level utilities for encoding datum values
into keys to the `rowenc/keyside` package.

Release note: None

rowenc: minor cleanup of array decoding code

The code around array decoding was confusing; we now have a single
`decodeArray` variant which is the inverse of `encodeArray`.

Release note: None

rowenc: move low-level value encoding functions to subpackage

This commit moves the lower level utilities for encoding datum values
into values to the `rowenc/valueside` package.

Release note: None

valueside: introduce ColumnIDDelta type

When encoding multiple values, we encode the differences between the
ColumnIDs. This difference is passed to `valueside.Encode`, but it is
passed as a `descpb.ColumnID`. This commit introduces a new type to
make this distinction more evident.

Release note: None

valueside: add Decoder helper

We have a few copies of the same logic of decoding multiple SQL values
from a Value, in particular in range feed handling code.

This commit adds a helper type that centralizes this logic, reducing
duplication.

Note that the query executione engines retain their own specialized
implementations.

Release note: None

sql: release BatchFlowCoordinator objects to pool

This commit adds `BatchFlowCoordinator` objects to their `vectorizedFlowCreator`'s
`releasables` slice, which ensures that their `Release` method is called and that
they are properly recycled.

Before this change, heap profiles and instrumentation both indicated that these
objects were not being recycled as intended. For example, heap profiles looked like:
```
      Type: alloc_objects
Time: Dec 30, 2021 at 2:10pm (EST)
Active filters:
   focus=Pool\).Get
Showing nodes accounting for 92189, 0.57% of 16223048 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                             71505 79.69% |   github.com/cockroachdb/cockroach/pkg/sql/colflow.NewBatchFlowCoordinator /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:218
                                             10923 12.17% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.newLockTableGuardImpl /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table.go:452
                                              2048  2.28% |   github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree.(*Map).maybeInitialize /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree/map.go:112
                                              1170  1.30% |   github.com/cockroachdb/cockroach/pkg/sql/colexec.newMaterializerInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:179
                                              1030  1.15% |   github.com/cockroachdb/pebble.(*Iterator).Clone /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:1228
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql.MakeDistSQLReceiver /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql/colfetcher.NewColBatchScan /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:223
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/sql/span.MakeBuilder /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/span/span_builder.go:61
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/storage.mvccGet /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:859
                                               585  0.65% |   github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2357
                                               128  0.14% |   github.com/cockroachdb/pebble.(*DB).newIterInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:765
         0     0%     0%      89729  0.55%                | sync.(*Pool).Get /Users/nathan/Go/go/src/sync/pool.go:148
```
Notice that `sync.(*Pool).Get` is responsible for *0.55%* of heap allocations and
**79.69%** of these are from `colflow.NewBatchFlowCoordinator`.

After this change, that source of allocations goes away and we see the following
impact on micro-benchmarks:
```
name                    old time/op    new time/op    delta
KV/Scan/SQL/rows=1-10     95.1µs ± 7%    95.9µs ± 5%    ~     (p=0.579 n=10+10)
KV/Scan/SQL/rows=10-10     100µs ± 3%     103µs ±12%    ~     (p=0.829 n=8+10)

name                    old alloc/op   new alloc/op   delta
KV/Scan/SQL/rows=10-10    21.7kB ± 0%    21.5kB ± 0%  -0.76%  (p=0.000 n=10+10)
KV/Scan/SQL/rows=1-10     20.1kB ± 0%    19.9kB ± 0%  -0.70%  (p=0.000 n=10+9)

name                    old allocs/op  new allocs/op  delta
KV/Scan/SQL/rows=1-10        245 ± 0%       244 ± 0%  -0.41%  (p=0.000 n=10+10)
KV/Scan/SQL/rows=10-10       280 ± 0%       279 ± 0%  -0.36%  (p=0.001 n=8+9)
```

sql/catalog: restore fast-path in FullIndexColumnIDs

This commit restores a [fast-path](https://github.com/cockroachdb/cockroach/commit/c9e116e586f24c5f3a831ac653f14fd03f588b93#diff-19625608f4a6e23e6fe0818f3a621e716615765cb338d18fe34b43f0a535f06dL140)
in `FullIndexColumnIDs` which was lost in c9e116e. The fast-path avoided
the allocation of a `ColumnID` slice and a `IndexDescriptor_Direction`
slice in `FullIndexColumnIDs` when given a unique index. In such cases,
these slices are already stored on the `IndexDescriptor`.

```
name                   old time/op    new time/op    delta
KV/Scan/SQL/rows=1-10    94.9µs ±10%    94.9µs ± 8%    ~     (p=0.739 n=10+10)

name                   old alloc/op   new alloc/op   delta
KV/Scan/SQL/rows=1-10    20.1kB ± 0%    20.1kB ± 1%    ~     (p=0.424 n=10+10)

name                   old allocs/op  new allocs/op  delta
KV/Scan/SQL/rows=1-10       245 ± 0%       241 ± 0%  -1.63%  (p=0.000 n=10+8)
```

kv: protect Replica's lastToReplica and lastFromReplica fields with raftMu

This commit moves the Replica's lastToReplica and lastFromReplica from
under the `Replica.mu` mutex to the `Replica.raftMu` mutex. These are
strictly Raft-specific pieces of state, so we don't need fine-grained
locking around them. As a reward, we don't need to grab the `Replica.mu`
exclusively (or at all) when setting the fields in
`Store.withReplicaForRequest`.

The locking in `setLastReplicaDescriptors` showed up in a mutex profile
under a write-heavy workload. It was responsible for **3.44%** of mutex
wait time. Grabbing the mutex was probably also slowing down request
processing, as the exclusive lock acquisition had to wait for read locks
to be dropped.

coldata: operate on Nulls value, not reference

This commit changes `col.Vec.SetNulls` to accept a `Nulls` struct by
value instead of by pointer. This lets us avoid a heap allocation on
each call to `Nulls.Or`.

Releaes note: None

roachtest: fix silly bug in tlp roachtest

I thought that we'd get some basic coverage of roachtests in unit tests
at low duration or something, but I guess not.

Release note: None

sql: support readonly default_with_oids var

This is just one less error when importing PGDUMP.

Release note (sql change): We now support `default_with_oids` which
only accepts being false.

importccl,jobs: truncate all job system errors

In #73303 we truncated the row data in this error message to prevent
problems with large row data preventing the job status from being
saved.

However, truncating the row means that our
"experimental_save_rejected" option does not work as expected. This
feature previously saved entire rows and now it would have truncated
rows in some case.

Now, we only truncate the error when producing the error
string. Further, we also add truncation to the job system to prevent
other parts of the system from saving too much data to the job system.

Release note: None

importccl: remove unused argument

Release note: None

sql/logictest: add diff support to expectation output

This adds support for generating a unified diff for expectation
mismatches. This can make it easier to spot the differences between
two results when the result contains many rows.

For example:

```
    ../../sql/logictest/testdata/logic_test/system_namespace:54: SELECT * FROM system.namespace
    expected:
        0   0   defaultdb                        50
        0   0   postgres                         51
        0   0   system                           1
        0   0   test                             52
        1   0   public                           29
        1   29  comments                         24
        1   29  database_role_settings           44
        1   29  descriptor                       3
        1   29  descriptor_id_seq                7
        1   29  eventlog                         12
        1   29  jobs                             15
        1   29  join_tokens                      41
        1   29  lease                            11
        1   29  locations                        21
        1   29  migrations                       40
        1   29  namespace                        30
        1   29  protected_ts_meta                31
        1   29  protected_ts_records             32
        1   29  rangelog                         13
        1   29  replication_constraint_stats     25
        1   29  replication_critical_localities  2
        1   29  replication_stats                27
        1   29  reports_meta                     28
        1   29  role_members                     23
        1   29  role_options                     33
        1   29  scheduled_jobs                   37
        1   29  settings                         6
        1   29  sql_instances                    46
        1   29  sqlliveness                      39
        1   29  statement_bundle_chunks          34
        1   29  statement_diagnostics            36
        1   29  statement_diagnostics_requests   35
        1   29  statement_statistics             42
        1   29  table_statistics                 20
        1   29  transaction_statistics           43
        1   29  ui                               14
        1   29  users                            4
        1   29  web_sessions                     19
        1   29  zones                            5
        50  0   public                           29
        51  0   public                           29
        52  0   public                           29

    but found (query options: "rowsort" -> ignore the following ordering of rows) :
        0   0   defaultdb                        50
        0   0   postgres                         51
        0   0   system                           1
        0   0   test                             52
        1   0   public                           29
        1   29  comments                         24
        1   29  database_role_settings           44
        1   29  descriptor                       3
        1   29  descriptor_id_seq                7
        1   29  eventlog                         12
        1   29  jobs                             15
        1   29  join_tokens                      41
        1   29  lease                            11
        1   29  locations                        21
        1   29  migrations                       40
        1   29  namespace                        30
        1   29  protected_ts_meta                31
        1   29  protected_ts_records             32
        1   29  rangelog                         13
        1   29  replication_constraint_stats     25
        1   29  replication_critical_localities  26
        1   29  replication_stats                27
        1   29  reports_meta                     28
        1   29  role_members                     23
        1   29  role_options                     33
        1   29  scheduled_jobs                   37
        1   29  settings                         6
        1   29  sql_instances                    46
        1   29  sqlliveness                      39
        1   29  statement_bundle_chunks          34
        1   29  statement_diagnostics            36
        1   29  statement_diagnostics_requests   35
        1   29  statement_statistics             42
        1   29  table_statistics                 20
        1   29  transaction_statistics           43
        1   29  ui                               14
        1   29  users                            4
        1   29  web_sessions                     19
        1   29  zones                            5
        50  0   public                           29
        51  0   public                           29
        52  0   public                           29

    Diff:
    --- Expected
    +++ Actual
    @@ -20,3 +20,3 @@
         1   29  replication_constraint_stats     25
    -    1   29  replication_critical_localities  2
    +    1   29  replication_critical_localities  26
         1   29  replication_stats                27
logic.go:3340:
    ../../sql/logictest/testdata/logic_test/system_namespace:100: error while processing
```

The diff output is only added when the `-show-diff` flag is provided
because in some cases it can produce more noise than it is worth.

The diff library used here is the same one already used by the testify
libraries we depend on.

Release note: None

teamcity-trigger: give `kvserver` package higher stress timeout

Closes https://github.com/cockroachdb/cockroach/issues/69519.

Release note: None

backupccl: allow columns of type array in EXPORT PARQUET

Previously, EXPORT PARQUET only supported CRDB relations whose columns were
scalars of type int, string, float, or  boolean. This change allows columns of
type array whose values can be int, string, float, boolean. Following the CRDB
ARRAY documentation, a value in an exported array can be NULL as well.

Informs: #67710

Release note (sql change): EXPORT PARQUET can export columns of type array

backupccl: remove old makeImportSpans

Release note: none.

backupccl: expand some comments on covering code

Release note: none.

security: make the bcrypt cost configurable

Release note (security update): For context, when configuring
passwords for SQL users, if the client presents the password in
cleartext via ALTER/CREATE USER/ROLE WITH PASSWORD, CockroachDB is
responsible for hashing this password before storing it.
By default, this hashing uses CockroachDB's bespoke `crdb-bcrypt`
algorithm, itself based off the standard Bcrypt algorithm.

The cost of this hashing function is now configurable via the new
cluster setting
`server.user_login.password_hashes.default_cost.crdb_bcrypt`.
Its default value is 10, which corresponds to an approximate
password check latency of 50-100ms on modern hardware.

This value should be increased over time to reflect improvements to
CPU performance: the latency should not become so small that it
becomes feasible to bruteforce passwords via repeated login attempts.

Future versions of CockroachDB will likely update the default accordingly.

security: make `server.user_login.min_password_length` visible in doc gen

This cluster setting was meant to be exported for visibility in
auto-generated docs (we've documented it before). This was an oversight.

Release note: None

authors: add Fenil Patel to authors

Release note: None

bench: force the index join in BenchmarkIndexJoin

Without using the index hint, we now choose to perform the full scan
over the primary index and put the filter on top of that rather than
performing a limited scan of the secondary followed by an index join.
I confirmed that this is the case on 21.1 and 21.2 binaries, so I'll
backport to the latter (in order to make the comparison between releases
sane).

Release note: None

dev: a few improvements to the script

* Add `set -e` so if the `dev` build fails, it doesn't try to run the
  binary anyway.
* Add a way to force recompilation of `dev` (useful for developers).

Release note: None

changefeedccl: Shutdown tenant before main server.

Shutdown tenant server before stopping main test server.
This elliminates some of the error messages we see in the test
output when tenant attempts to connect to some ranges which are no
longer accessible.

Release Notes: None

importccl: fix import pgdump target column bug

Previously, if a COPY FROM statement had less columns than
the CREATE TABLE schema defined in the dump file,
 we would get a nil pointer exception. This is because
we were not filling the non-targeted columns with a NULL datum.
This change fixes that and aligns behvaiour with how INSERT
handles non-targeted columns.

Release note (bug fix): IMPORT TABLE ... PGDUMP with a
COPY FROM statement in the dump file that has less target columns
than the CREATE TABLE schema definition would result in a
nil pointer exception.

roachtest: add new passing tests for pgjdbc nightly

Release note: None

go.mod: fix ordering

The first `require` is for direct dependencies. The second one for
indirect dependencies. There's no need for a third one.

Release note: None

Makefile: specify that code needs to be generated for `roach{test,prod}`

Release note: None

setting: clean up slot indexes

The internal setting code passes around slot indexes as bare integers,
and there are confusingly two variants: one is 1-indexed, another is
0-indexed.

This change makes all slot indexes 0-indexed and adds a `slotIdx`
type.

Note: the 1-indexed choice was perhaps useful at the time to help find
code paths where the slot index is not initialized, but now the slot
index is set along with other mandatory fields by `init()`.

Release note: None

setting: prevent use of SystemOnly settings from tenant code

This change implements the `system-only` semantics in the RFC
(#73349).

All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.

Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.

build: fix Pebble nightly benchmarks

The Pebble nightly benchmarks stopped running due to a build error
stemming from the removal of the generated code from the codebase.

Release note: None

scpb: move Node to screl, introduce scpb.TargetState

This refactoring commit moves scpb.Node to screl and groups the targets
into a new scpb.TargetState. Separating the quasi-constant target
elements and statuses from the elements' intermediate statuses
ultimately makes for cleaner code in the declarative schema changer.

Release note: None

scplan: make child packages internal, including scgraph and scgraphviz

This commit moves scgraph and scgraphviz under scplan, and makes all its
child packages internal.

Release note: None

<pkg>: <short description - lowercase, no final period>

authors: add gtr to authors

Release note: None

sql: format statements for the UI

Previously, statements sent to the UI were not formatted, making them difficult
to read. With this change, statements are now formatted using a builtin
function that prettifies statements (using existing pretty-printing logic).
Statements are formatted using pre-determined pretty-printing configuration
options mentioned in this issue: #71544.

Resolves: #71544

Release note (sql change): statements are now formatted prior to being send to
the UI, this is done using a new builtin function that formats statements.

ui: added formatting to statements on the details pages

Previously, statements displayed on the statement/transaction/index details
pages were not formatted.  Formatting was added to allow for better readability
of statements on these detail pages.  Requested statements are now formatted on
the server using the existing pretty-printing logic. Statements returned from
the statement handler are now formatted.

Formatting is done via a new builtin function 'prettify_statement', a wrapper
function over the existing pretty-printing logic.

Resolves: #71544

Release note (ui change): added formatting to statements on the statement,
transaction and index details pages.

roachpb: InternalServer API for tenant settings

This commit introduces the API that the tenants will use to obtain and
list for updates to tenant setting overrides. The API was designed to
allow for maximum flexibility on the server side so that it can be
implemented as a range feed without any extra state (if necessary).

Release note: None

schemachanger: fully qualify object names inside event log entries

Previously, the original SQL was displayed inside the declarative schema
changers event logs, which was both missing redactions and full name resolutions.
This was inadequate because the existing schema changer always fully resolved
missing names within its entries. To address this, this patch adds new metadata
that contains the fully resolved and redacted text.

Release note: None

schemachanger: full resolve names inside the AST for event logs and errors

Previously, the declarative schema changer left the AST
untouched when resolving names. This was inadequate because
both event log entry generation and some error messages generated
expect the fully resolved names inside the AST. To address this,
this patch adds support for copying and altering the AST including
support for adding annotations. Additionally, all resolved names
are now propagated back into the AST and extra validation is introduced
to make sure that no unresolved names are left.

Release note: None

schemachanger: add statement tag to event log entries

Previously, the tag field inside the event log entries
generated by the declarative schema changer were empty.
This was inadequate because the existing schema changer
always populated these fields. To address this, this
patch will now populate the statement tag field inside
the element metadata and into event log entries.

Release note: None

rfc: token-based authentication for SQL session revival

Release note: None

vendor: upgrade cockroachdb/apd to v3

This commit picks up the following changes to `cockroachdb/apd`:
- https://github.com/cockroachdb/apd/pull/103
- https://github.com/cockroachdb/apd/pull/104
- https://github.com/cockroachdb/apd/pull/107
- https://github.com/cockroachdb/apd/pull/108
- https://github.com/cockroachdb/apd/pull/109
- https://github.com/cockroachdb/apd/pull/110
- https://github.com/cockroachdb/apd/pull/111

Release note (performance improvement): The memory representation of
DECIMAL datums has been optimized to save space, avoid heap allocations,
and eliminate indirection. This increases the speed of DECIMAL arithmetic
and aggregation by up to 20% on large data sets.

sql/sem: remove EvalContext.getTmpDec

`apd.Decimal` can now be entirely stack allocated during arithmetic, so
there's no longer any need for this.

With https://github.com/cockroachdb/apd/pull/104, this does not introduce
any new heap allocations:
```
➜ (cd pkg/sql/sem/tree && goescape . | grep moved | wc -l)
     328
```

scripts: add sgrep

@aliher1911 showed me this handy awk trick to filter goroutines [here].

I find myself reaching for this frequently, but it always takes me a
minute or two to get it right. This helper will make it a lot easier
and will perhaps enable many others to reap the benefits as well.

[here]: https://github.com/cockroachdb/cockroach/pull/66761#pullrequestreview-691529838

Release note: None

sqlproxyccl: better test output

This moves the logging output to files and enables inspection of the
authentication results in logs crdb-side.

Release note: None

storage/metamorphic: update clearRangeOp to not conflict with in-flight writes

This change updates clearRangeOp to only run on key spans
that do not have any transactional writes in-progress. This more
closely matches behaviour in KV and avoids cases where we'd trample on
intents and other in-flight keys. Also makes a similar adjustment
to mvccClearTimeRangeOp.

Fixes #72476. Speculatively. Reproduction was nearly impossible.

Release note: None.

Revert "roachpb: change `Lease.String()`/`SafeFormat()` to support refs."

This reverts commit 33fe9d15b2660cc40d255a3d448d88bb35c2c39d.

storage,roachpb: Add Key field to WriteTooOldError

Currently, WriteTooOldErrors that happen as part of a
ranged put operation (eg. MVCCClearTimeRange) can be very
opaque, as we don't know what key we tried to "write too old"
on. This change addresses that by adding a Key field to WriteTooOldError
to store one of the keys the error pertains to.

Informs #72476.

Release note: None.

dev: enumerate explicit list of all generated `.go` files when hoisting

We should be able to `find _bazel/bin/go_path -type f -name '*.go'` and
list the generated files that way, but bazelbuild/rules_go#3041 is in
the way and I can't figure out how to resolve it. Instead we have to do
things the hard way: run `bazel aquery` and parse out the paths to all
the generated files. In this way we avoid accidentally hoisting out
stale symlinks from a previous build. If bazelbuild/rules_go#3041 is
resolved upstream then this change can be reverted.

Release note: None

sql: public schema long running migration

Release note: None

sql: insert missing public schema namespace entry

When restoring a database, a namespace entry for the public
schema was not created.

Release note: None

authors: add msirek to authors

Release note: None

sql: Add hints to CREATE/ALTER table errors for MR

Add some hints when trying to create (or alter to) a multi-region table which
indicate that if the database is not multi-region enabled, it should be
converted to a multi-region enabled database via a "ALTER DATABASE ... SET
PRIMARY REGION <region>" statement.

Release note: None

catalog: add Index*Columns methods to table descriptor

These methods return the columns referenced in indexes as
[]catalog.Column slices.

Release note: None

catalog: add Index*ColumnDirections methods to table descriptor

This complements the previous commit which added Index*Column methods to
catalog.TableDescriptor.

Release note: None

catalog: remove FullIndexColumnIDs

Calls to this function can now be replaced with recently-added
IndexFullColumns and IndexFullColumnDirections method calls on the table
descriptor.

Release note: None

tabledesc: improve memory-efficiency of Index*Column* methods

This commit removes IndexCompositeColumns (which wasn't actually used
and is unlikely to ever be) and generates the indexColumnCache with less
memory allocations. The current scheme is causing
a statistically-significant performance regression.

Release note: None

sql: allow the 1PC optimization in some cases in the extended protocol

A previous commit (7e2cbf51869fc326974a5665db80da8b29422631) fixed our
pgwire implementation so that it does not auto-commit a statement
executed in the extended protocol until a Sync message is received. That
change also had the undesired effect of disabling the 1PC ("insert fast
path") optimization that is critical for write-heavy workloads.

With this current commit, the 1PC optimization is allowed again, as long
as the statement execution is immediately followed by a Sync message.
This still has the correct bugfix semantics, but allows the optimization
for the common case of how the extended protocol is used.

No release note since this only affects unreleased versions.

Release note: None

execgen: remove temporary decimals from the helper

This commit removes two temporary decimals from `execgen.OverloadHelper`
since after the upgrade of the decimal library it can use
stack-allocated temporary objects without them escaping to the heap.

Release note: None

colexec: batch allocate datums for projectTupleOp

This has a profound impact on the amount of garbage generated by the delivery
query in TPC-C.

```
name                    old time/op    new time/op    delta
TPCC/mix=delivery=1-16    38.0ms ± 2%    35.8ms ± 1%   -5.76%  (p=0.000 n=9+8)

name                    old alloc/op   new alloc/op   delta
TPCC/mix=delivery=1-16    8.17MB ± 1%    7.97MB ± 1%   -2.36%  (p=0.000 n=9+10)

name                    old allocs/op  new allocs/op  delta
TPCC/mix=delivery=1-16     80.2k ± 0%     20.3k ± 1%  -74.65%  (p=0.000 n=10+9)
```

Leveraged https://github.com/cockroachdb/cockroach/pull/74443 to find this.

Release note: None

kvserver/loqrecovery: check full key coverage in quorum recovery

Previously when doing unsafe replica recovery, if some ranges are
missing or represented by stale replicas that were split or merged,
recovery will change cluster to inconsistent state with gaps or
overlaps in keyspace.
This change adds checks for range completeness as well as adds a
preference for replicas with higher range applied index.

Release note: None

sql: fix enum hydration in distsql expression evaluation

Fixes: #74442

Previously in some circumstances we could fail to hydrate enum types
used in join predicate expressions and possibly other situations. Now
types used in ExprHelper are always hydrated during Init phase when a
distsql type resolver is being used. Also add a test case for the lookup
semi join repro case.

Release note (bug fix): Fix panic's possible in some distributed queries
using enum's in join predicates.

backupccl: add libgeos to BUILD.bazel

This is required when using randgen so that libgeos can be initialized
when a geometric type is generated.

Fixes #73895

Release note: None

sql: refactor pg_builtin to use actual grant options

Refactor builtins.priv -> privilege.Privilege.

Replace privilege.Kind with privilege.Privilege in functions that need
access to privilege.Privilege.GrantOption.

Release note: None

sql: refactor pg_builtin to use actual grant options

The builtins has_table_privilege, has_column_privilege,
has_any_column_privilege now use privileges.Priv.GrantOption instead
of privileges.Kind.GRANT.

Release note: None

dev: introduce common benchmarking flags

Specifically:
  --bench-mem (to print out memory allocations);
  --bench-time (to control how long each benchmark is run for);
  --count (how many times to run it, also added to `dev test`);
  -v and --show-logs (similar to `dev test`)

We also add supports for args-after-double-dash-are-for-bazel within
`dev bench`. This commit is light on testing (read: there isn't any), so
doesn't bump DEV_VERSION to roll it out to everyone just yet.

Release note: None

execgen: skip encoding/decoding JSON when hashing it

Previously, in order to hash a JSON object we would encode and decode it
(due to the behavior of `coldata.JSONs.Get` and the setup of the
execgen). This commit refactors how the hash functions are handled in
the execgen which allows us to peek inside of `coldata.Bytes` (that is
under the `coldata.JSONs`) in order to get direct access to the
underlying `[]byte`. This should be a minor performance improvement but
also allows us to remove the need for the overload helper in the hashing
context.

It is possible (although I'm not certain) that the hash computation is
now different, so this commit bumps the DistSQL versions to be safe.

Release note: None

colexechash: fix BCEs in some cases

Previously, we were not getting (and not asserting) some of the bounds
check eliminations in `rehash` function because we were accessing
`.Sliceable` property in the wrong context. This is now fixed (by
accessing it via `.Global`) as well as assertions are added correctly.
Additionally, this commit pulls out the call to the cancel checker out
of the templated block to reduce the amount of code duplication.

Release note: None

execinfra: prefer leased table descriptors

This commit makes better use of the table descriptor protos in the
DistSQL specs by first checking if they've already been leased. If so,
then we use those instead of re-generating catalog.TableDescriptor
instances.

This has a statistically-significant impact on memory allocations, as
illustrated with this microbenchmark which I ran on my development
machine:

    name                    old time/op    new time/op    delta
    KV/Scan/SQL/rows=1-16      190µs ± 7%     184µs ± 4%   -3.60%  (p=0.016 n=10+8)
    KV/Scan/SQL/rows=10-16     196µs ± 4%     198µs ±12%     ~     (p=0.762 n=8+10)

    name                    old alloc/op   new alloc/op   delta
    KV/Scan/SQL/rows=1-16     19.5kB ± 1%    17.4kB ± 1%  -11.12%  (p=0.000 n=9+10)
    KV/Scan/SQL/rows=10-16    21.0kB ± 1%    18.9kB ± 1%  -10.20%  (p=0.000 n=10+10)

    name                    old allocs/op  new allocs/op  delta
    KV/Scan/SQL/rows=1-16        222 ± 0%       210 ± 1%   -5.59%  (p=0.000 n=7+10)
    KV/Scan/SQL/rows=10-16       256 ± 0%       244 ± 0%   -4.84%  (p=0.000 n=10+7)

This change opens us to the possibility of no longer shipping the whole
table descriptor proto in the DistSQL spec.

Release note: None

sql: add missing error check in GetHydratedZoneConfigForNamedZone

Closes #74606

Release note: None

sql: rename pg_cast_provolatile_dump.csv to pg_cast_dump.csv

Release note: None

sql: remove castInfo

`castMap` now contains all volatility information and `castInfo` is no
longer needed. `castInfo` has been removed.

For backwards compatibility, some invalid casts between VARBIT and
integer types have been added to `castMap`. These casts have unexpected
behavior before and after this commit. They are not supported by
Postgres.  They should be disallowed in the future.
See #74577 and 74580

Casts between tuple and string types are now dynamically handled in
`lookupCast` support casts with named record types that have different
OIDs than `oid.T_record`.

Casts from OID and REG* types to INT2 are now allowed to maintain
backward compatibility.

Release note: None

sql: update pg_cast_dump.csv

This commit updates the `pg_cast_dump.csv` file with new rows for casts
from the JSONB type (OID 3802). It also adds the Postgres version as a
column to annotate the version of Postgres that the CSV was generated
from. This is important because the `pg_cast` table can change from
version to version. This CSV was generated from Postgres version 13.5.

Release note: None

sql: consistently order pg_cast_dump.csv

Release note: None

sql: check cast contexts against Postgres's pg_cast table

This commit includes the `pg_cast.castcontext` column in
`pg_cast_dump.csv` and uses it to validate the `maxContext` field in
`castMap`. Note that Postgres's `pg_cast` table does not include all
possible casts, such as automatic I/O conversions, so this test is not
comprehensive.

Release note: None

ci: bazelize nightly pebble ycsb, write-throughput benchmarks

We still have to take care of the metamorphic nightly.

Part of #67335.

Release note: None

authors: add bardin to authors

Release note: None

opt: fix corner case with lookup joins and provided orderings

This commit fixes a case where the lookup join was passing through a
provided ordering with unnecessary columns. This was caused by
imperfect FDs at the join level such that the ordering cannot be
simplified at the join level but it can be simplified at the level of
its input.

Note that the case causes an internal error in test builds but there
are no known examples of user-visible problems in non-test builds
(hence no release note).

Fixes #73968.

Release note: None

rpc: fix span use-after-Finish in internalClientAdapter

The internalClientAdapter performs some local "RPCs" by directly calling
the server method, without going through gRPC. The streaming RPC calls
are made on a different goroutine, and these goroutines were using the
callers tracing span. These goroutines could outlive the caller's span,
resulting in a use-after-Finish crash. This patch fixes them by creating
dedicated RPC spans, mimicking what our gRPC interceptors do.

Fixes #74326

Release note: None

vendor: Bump pebble to 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196

Changes pulled in:

```
3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 *: Add trySeekUsingNext optimization to SeekGE
0c503048eb0365981929177c30178add8a56ae3e sstable: add (*sstable.Writer).RangeKey{Set,Unset,Delete} methods
fe52b49cc28df62dce9b00c382a5ce217936be56 tool/logs: aggregate compaction logs by node and store ID
8ab4358bc59dfa62e5e34e4b0e5ce81a68f5fe91 sstable: return err from `(BlockPropertyCollector).FinishTable`
91c18ef0ee999980c2869d11e5ce468410acbe8d internal/keyspan: add FragmentIterator interface
953fdb078ff0585489206ae96e1d80ca9f6f90c7 internal/keyspan: implement SetBounds on Iter
aa376a819bf67cd6766ee827feed4bf0bd508f1f tool: add compaction log event aggregation tool
```

Release note: None.

storage: Use optimized SeekGE in CheckSSTConflicts

This nearly reverts #73514 by moving back to calling SeekGE on
the engine to skip past any empty spans on either the engine
or the SSTable. This is the more optimal approach on average,
and given optimizations in cockroachdb/pebble#1412 which this
change depends on, it also ends up performing better than
a SeekPrefixGE-driven appraoch and the pre-#73514 approach.

Improvement when running BenchmarkCheckSSTConflicts
against the pre-#73514 revision (vs. this one):

```
name                                                                      old time/op  new time/op  delta
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=false-24      72.6µs ±11%  66.3µs ± 1%   -8.67%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=true-24       12.2ms ± 1%   1.7ms ± 1%  -86.41%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=false-24     69.8µs ± 2%  67.4µs ± 1%   -3.48%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=true-24      13.3ms ± 3%   2.8ms ± 1%  -78.97%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=false-24     75.8µs ± 3%  63.8µs ± 1%  -15.86%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=true-24      13.0ms ± 1%   1.9ms ± 1%  -85.11%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=false-24    69.8µs ±11%  64.6µs ± 1%   -7.45%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=true-24     14.8ms ± 9%   3.1ms ± 2%  -79.05%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=false-24     66.1µs ± 2%  63.7µs ± 1%   -3.65%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=true-24      14.2ms ± 9%   1.9ms ± 1%  -86.55%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=false-24    72.3µs ±10%  64.5µs ± 0%  -10.77%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=true-24      122ms ± 2%    17ms ± 1%  -86.03%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=false-24    69.0µs ± 9%  62.4µs ± 1%   -9.57%  (p=0.032 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=true-24     14.0ms ± 1%   2.3ms ± 2%  -83.46%  (p=0.016 n=4+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=false-24   69.4µs ± 9%  62.7µs ± 1%   -9.63%  (p=0.016 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=true-24     140ms ± 5%    26ms ± 1%  -81.70%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=false-24    69.2µs ±10%  62.5µs ± 1%   -9.66%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=true-24     15.3ms ±11%   2.3ms ± 1%  -85.21%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=false-24   69.7µs ±12%  63.6µs ± 1%     ~     (p=0.095 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=true-24     148ms ± 6%    28ms ± 2%  -80.90%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=false-24   67.1µs ±10%  61.1µs ± 2%   -8.93%  (p=0.016 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=true-24    14.4ms ± 2%   2.5ms ± 5%  -82.45%  (p=0.016 n=4+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=false-24  68.9µs ±21%  62.2µs ± 1%   -9.76%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=true-24    204ms ±14%    42ms ± 5%  -79.44%  (p=0.008 n=5+5)
```

Fixes #66410.

Release note: None.

execinfrapb: add a helper for index joins based on the JoinReaderSpec

Release note: None

rowexec: refactor the joinReader to not exceed the batch size

The joinReader operates by buffering the input rows until a certain size
limit (which is dependent on the strategy). Previously, the buffering
would stop right after the size limit is reached or exceeded, and this
commit refactors the code to not exceed the limit except in a case of
a single large row. This is what we already do for vectorized index
joins.

Release note: None

sql,kv: introduce Streamer API and use it for index joins in some cases

This commit introduces the Streamer API (see
https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210617_index_lookups_memory_limits.md)
as well as its implementation for the simplest case - when requests are
unique and can be served in any order. It additionally hooks up the
implementation to be used by the index joins in both execution engines.

There are three main pieces that this commit adds:
1. the `Streamer` struct itself. It is largely the same as described in
the RFC. Some notable changes are:
- `Cancel` is renamed to `Close` and is made blocking to ensure that all
goroutines kicked off by the `Streamer` exit before `Close` returns.
- `Shrink` has been removed from the `budget` struct (see below for
more details).
- furthermore, the `budget` interface has been unexported and the
`streamer` has been tightly coupled with the `budget`'s implementation.
- the TODO about collecting DistSQL metadata is removed because we are
collecting the LeafTxnFinalState already when using the LeafTxn.
2. the limited `Streamer` implementation - only `OutOfOrder` mode is
supported when the requests are unique. Notably, buffering and caching
of the results is not implemented yet.
3. `TxnKVStreamer` component that sits below the SQL fetchers, uses the
`Streamer`, and is an adapter from `BatchResponse`s to key/value pairs
that fetchers understand. Although not used at the moment,
`TxnKVStreamer` is written under the assumption that a single result can
satisfy multiple requests.

The memory budget of the `Streamer` is utilized lazily. The RFC was
calling for making a large reservation upfront and then shrinking the
budget if we see that we don't need that large reservation; however,
while working on this I realized that lazy reservations are a better fit
for this. The `Streamer` can reserve up to the specified limit
(determined by `distsql_workmem` variable) against the root monitor (in
the degenerate case of a single large row more memory will be reserved).
The reservation then never shrinks under the assumption that if the
reservation has gotten large, it means it was needed for higher
concurrency (or large responses), and it is likely to be needed for the
same reasons in the future.

The layout of the main components of the `Streamer` implementation:
- in `Enqueue` we have a logic similar to what DistSender does in order
to split each request (that might span multiple ranges) into
single-range requests. Those sub-requests are batched together to be
evaluated by a single `BatchRequest`.
- `workerCoordinator.mainLoop` is responsible for taking single-range
batches, estimating the corresponding response size, and issuing
requests to be evaluated in parallel while adhering to the provided
memory budget.
- `workerCoordinator.performRequestAsync` is responsible for populating
the `BatchRequest` and then processing the results while updating the
memory budget.

Current known limitations that will be addressed in the follow-up work:
- at the moment a single row can be split across multiple BatchResponses
when TargetBytes limit is reached when the table has multiple column
families; therefore, we use the streamer only for single column family
cases. We will expand the KV API shortly to not split the rows in
multiple column family cases.
- manual refresh of spans when `ReadWithinUncertaintyIntervalError` is
encountered by a single streamer in a single flow is not implemented. It
is an optimization that is considered a must for the final
implementation in order to not regress on simple cases in terms of
retriable errors. This will be implemented shortly as a follow-up.
- I'm thinking that eventually we probably want to disable the batch
splitting done by the DistSender to eliminate unnecessary blocking when
the streamer's splitting was incorrect. This would give us some
performance improvements in face of range boundary changes, but it
doesn't seem important enough for the initial implementation.

Release note: None

sql: check equivalent constraint when creating hash index

Fixes #68031

Previously we only try to create constraint for shard column if
it's newly created. We check duplicate constraint for shard
column when `Alter Primary Key` and `Create Index`, however the
check is simply a name check. This pr adds logic to check
equivalent constraint by checking if the formatted expression
string is the same. With this logic we can try to create the
constraint no matter if a shard column is newly created or not.
With this fix, we also don't need to expose the constraint through
`SHOW CREATE TABLE` result since we make sure the constraint is
created or skipped if one already exists.

Release note (sql change): Before this change, the check constraint
on shard column used by hash sharded index was printed in the
corresponding `SHOW CREATE TABLE`. The constraint had been shown
because cockroach lacked logic to ensure that shard columns which
are part of hash sharded indexs always have the check constraint
which the optimizer relies on to achieve properly optimized plans
on hash sharded indexes. We no longer display this constraint in
`SHOW CREATE TABLE` as it is now implied by the `USING HASH` clause
on the relevant index.

colexec: clean up the usage of the binary overload helper

Previously, for projection and selection operators we would always pass
in `execgen.OverloadHelper`. Up until recently it served several
purposes, but now it is only used in order to pass down the binary
function (as well as the eval context) for the cases when we fallback
to the row-by-row computation.

This commit takes advantage of this observation and cleans up the
situation: now the helper is only passed when it is needed which allows
us to remove a lot of redundant code. Additionally, the helper itself
has been renamed from `OverloadHelper` to `BinaryOverloadHelper`.

Release note: None

build: Add PATH to .bazelrc for dev builds.

Release note: none

opt: intern provided physical properties

This commit changes the provided physical properties to be referenced
with a pointer inside bestProps, in preparation for adding more fields
to physical.Provided.

Release note: None

opt: add Distribution physical property and Distribute enforcer

This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.

This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.

The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.

This commit represents a departure from the approach taken in the earlier
prototype in #43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.

Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions (e.g., which column is the
partitioning key, whether the data is hash or range partitioned, etc.);
it only tracks *which* regions contain data. This greatly simplifies the
implementation, and I don't think it will significantly reduce the
accuracy of costing.

Informs #47226

Release note: None

cli: add mt cert create-tenant-signing command

This key/cert will be used to generate session revival tokens.

No release note since this is only intended to be used internally.

Release note: None

sql/sem/catid: make a low-level package which defines ID types

I'm not sure where exactly in the tree this belongs.

Release note: None

sql/schemachanger/scpb: adopt catid

Release note: None

sql/catalog/catpb: add package below scpb and descpb

This doesn't go all the way to being principled about what is in catpb
and what is in descpb, but it pulls the relevant symbols referenced in
scpb down into the new package.

Release note: None

kvserver: fix bug causing spurious rebalance attempts by the store rebalancer

When rebalancing a range, the store rebalancer also tries to move the lease to
the voter with the lowest QPS. However, previously, the store rebalancer was
doing this even when it found no better replacement targets for the existing
stores. This meant that we were redundantly transferring leases over to the
coldest voter, even when we weren't rebalancing the range. This was likely
contributing to the thrashing observed in
https://github.com/cockroachdb/cockroach/issues/69817.

Release note (bug fix): A bug that could previously cause redundant lease
transfers has now been fixed.

schemachanger: delete comments when dropping schemas

Previously, the declarative schema changer did not clean
comments for schema objects when dropping them. This was
inadequate because the system.comment table would have
rows for dropped schema objects left over when using
the declarative schema changer versus the legacy schema
changer. To address this, this patch will remove rows
from the system.comment table for schema objects that
are dropped.

Release note: None

sql: adopt CommentUpdater from declarative schema changer.

Previously, there was duplicate code for adding/deleting comments
on schema objects inside the legacy schema changer. So, each
comment type would have similar code for setting up an
upsert/delete statements. This patch adopts the CommentUpdater
interface in the legacy schema changer, so that logic to
update/delete comments can be shared with the declarative
schema changer, and the simplify code in the legacy schema
changer.

Release note: None

kvserver: don't use `ClearRange` point deletes with estimated MVCC stats

`ClearRange` avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.

These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
`ClearRange` to submit a huge write batch, which gets rejected by Raft
with `command too large`.

This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.

Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.

backup: add memory monitor to manifest loading

Release note: none.

sql: add assignment casts for UPSERTs

Assignment casts are now added to query plans for upserts, including
`UPSERT`, `INSERT .. ON CONFLICT DO NOTHING`, and
`INSERT .. ON CONFLICT DO UPDATE ..` statements.

Assignment casts are a more general form of the logic for rounding
decimal values, so the use of `round_decimal_values` in mutations is no
longer needed. This logic has been removed.

Fixes #67083

There is no release note because the behavior of upserts should not
change with this commit.

Release note: None

sql: add logic tests for assignment casts of ON UPDATE expressions

Release note: None

sql: schema changer not to validate shard column constraint

Fixes #67613
The shard column constraint is created internally and should
be automatically upheld. So not need to verify it when backfilling
hash sharded indexes.

Release not: None

schemachanger: fix error messages for drop dependencies.

1) When a view depends on another view the message text generated was
   incorrect.
2) When a sequence is OWNED BY a table, it could be dropped even
   if there were other dependencies.
3) When RESTRICT was not specified DROP DATABASE would generate the wrong
   error.
4) If the database name is empty we would generate the wrong error.

Release note: None

sql: modify message for drop schema privilege errors

Previously, the legacy schema changer used a less clear message
for when a DROP SCHEMA failed due to a privilege error. This
patch will improve the message generated to be more clear
by using  message from the declarative schema changer.
The new message will have the form "must be owner
of schema <schema name>".

Release note: None

schemachanger: avoid leased descriptors for event logging

Previously, when fully resolving descriptor names, we would
fetch leased descriptors which could cause the schema change
transaction to get pushed out. This was inadequate because in
certain scenarios we would perpetually retry transactions,
since attempts to generate event log entries acquire leases
pushing out the schema change transaction. To address this,
this patch intentionally avoids leasing descriptors for event logging.

Release note: None

build: put regular file at bazel-out by default

Release note: none.

bulk,backup,import: add Write-at-now settings for RESTORE and IMPORT

This adds a new arguemnt to BulkAdderOptions and MakeSSTBatcher control
the WriteAtBatchTS field in the AddSSTableRequests it sends.

This flag is then optionally set in RESTORE and IMPORT based on the the
new hidden cluster settigs 'bulkio.restore_at_current_time.enabled' and
'bulkio.import_at_current_time.enabled' respectively.

These settings default to off for now, for testing usage only as setting
this flag is known to _significantly_ reduce the performance of those
jobs until further work is done.

Release note: none.

sql/row: remove an allocation for Get responses

```
name                    old time/op    new time/op    delta
IndexJoin/Cockroach-24    8.02ms ± 9%    7.92ms ± 2%    ~     (p=1.000 n=9+10)

name                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24    1.68MB ± 1%    1.62MB ± 1%  -3.83%  (p=0.000 n=10+9)

name                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24     11.1k ± 1%     10.1k ± 0%  -8.85%  (p=0.000 n=9+9)
```

Release note: None

kvstreamer: refactor memory tokens to reduce allocations

```
name                    old time/op    new time/op    delta
IndexJoin/Cockroach-24    7.72ms ± 7%    7.51ms ± 3%   -2.75%  (p=0.001 n=10+9)

name                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24    1.61MB ± 1%    1.60MB ± 1%   -0.91%  (p=0.001 n=10+9)

name                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24     10.1k ± 1%      9.1k ± 1%  -10.24%  (p=0.000 n=10+10)
```

Release note: None

spanconfig: drop 'experimental_' suffix from settings

We're going to start to use this infrastructure for realsies soon.

Release note: None

spanconfig/testcluster: plumb in testing knobs for tenants

We were previously using an empty one with no control at the caller to
override specific values. This comes in handy when looking to control
knobs for all tenants in these tests.

Release note: None

migrationsccl: disable reconciliation job creation in tests

In a future commit we'll enable the span configs infrastructure by
default for all crdb unit tests. Doing so will surface the need to have
the reconciliation job disabled for the specific migration tests that
assert on the contents of `system.span_configurations` (also written to
by the reconciliation job/reconciler).

Release note: None

spanconfig/sqltranslator: deflake datadriven test

This is a follow up to #73531, there we forgot to update package tests
to also use consistent reads when looking up descriptors. Looking at the
surrounding commentary in these datadriven files and comparing against
the actual results, there was a mismatch (now no longer so).

Release note: None

sql: future-proof TestTruncatePreservesSplitPoints

This test previously relied on range split decisions to happen
near-instantaneously (especially for the single node variant). This was
a fair assumption with the gossiped SystemConfigSpan, but is no longer
true with the span configs infrastructure where
(i)  updated descriptors/zone configs make its way to
     `system.span_configurations` asynchronously, and
(ii) KV learns about learns about `system.span_configurations` updates
     asynchronously.

We update the test to be agnostic to either subsystem (tl;dr - throw in
a SucceedsSoon block at the right places).

Release note: None

sql: future-proof TestScatterResponse

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that no longer
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of just expecting it.

Release note: None

kvserver: future-proof TestElectionAfterRestart

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

multiregionccl: future-proof TestEnsureLocalReadsOnGlobalTables

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

server: future-proof TestAdminAPIDatabaseDetails

This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.

Release note: None

changefeedccl: future-proof TestChangefeedProtectedTimestamps

The low gc.ttlseconds in this test that applies to
system.{descriptor,zones}, when run with span configs enabled (in a
future commit), runs into errors introduced in #73086. The span configs
infrastructure makes use of rangefeeds against these tables within the
spanconfig.SQLWatcher process. These rangefeeds error out if the
timestamp they're started with is already GC-ed, something that's very
likely with low GC TTLs. To accommodate, we simply bump the TTL to a
more reasonable 100s.

Release note: None

kvfollowerreadsccl: future-proof TestBoundedStalenessDataDriven

This is Yet Another Test that made timing assumptions on how
instantaneously range config decisions are applied, assumptions that
don't hold under the span configs infrastructure. Adding compatibility
is a matter of waiting for the right configs to appear instead of only
expecting it.

Release note: None

changefeedccl: future-proof TestChangefeedBackfillCheckpoint

This testing knob was added in #68374 but I'm not sure that it was
necessary? Brief stress runs with an without this flag did not surface
anything. In a future commit where we enable span configs by default,
we'll actually rely on the reconciliation job running, so we get rid of
this flag now.

Release note: None

liveness: future-proof TestNodeLivenessStatusMap

Instead of using hooks that directly mutate the system config span,
using SQL statements to tweak zone configs future proofs this test for
compatibility with the span configs infrastructure.

Release note: None

sql: migrate has_database_privilege from evalPrivilegeCheck to
ctx.Planner.HasPrivilege

refs https://github.com/cockroachdb/cockroach/issues/66173

HasPrivilege is required to support WITH GRANT OPTION

Release note: None

roachtest: fix sequelize nightly

Upstream changed how imports are done, so this library had to be
updated.

Release note: None

kvstreamer: fix potential deadlock

We have two different locks in the `Streamer` infrastructure: one is for
the `Streamer` itself and another one for the `budget`. Previously,
there was no contract about which mutex needs to be acquired first which
led to the deadlock detector thinking that there is a potential deadlock
situation. This is now fixed by requiring that the `budget`'s mutex is
acquired first and by releasing the `Streamer`'s mutex in `Enqueue`
early in order to not overlap with the interaction with the `budget`.

I believe that it was a false positive (i.e. the deadlock cannot
actually occur) because without the support of pipelining, `Enqueue`
calls and asynchronous requests evaluation never overlap in time.
Still, it's good to fix the order of mutex acquisitions.

Release note: None

sql: remove PHYSICAL scrub code

The PHYSICAL scrub code is experimental and not considered production
ready. It complicates a lot of code paths involved in normal query
execution (it significantly overloads the semantics of TableReader and
of the Fetcher) and is getting in the way of some improvements in how
the fetchers work. In particular, we are trying to reduce the amount
of information passed to TableReader/Fetcher (which in the
non-scrubbing case should be a lot less than the full table
descriptor).

There are some proposals for a better design floating around, e.g.
provide a facility for returning KVs as results from DistSQL and have
some higher-level code run the scrub checks.

This change removes the code for the PHYSICAL scrub for now.

Release note (sql change): the experimental SCRUB PHYSICAL is no
longer implemented.

vars: add placeholder session variable for xmloption

Release note (sql change): The `xmloption` session variable is now
accepted, only taking in `content`. Note this does not do anything.

clusterversion: improve a version comment

Gets rid of a squiggly line in Goland.

Release note: None

kvserver: plumb in a context into (*Store).GetConfReader

We'll use it in a future commit.

Release note: None

spanconfig/reconciler: export the checkpoint timestamp

We'll make use of it in a future commit.

Release note: None

spanconfig: get rid of ReconciliationDependencies interface

It was hollow, simply embedding the spanconfig.Reconciler interface. In
a future commit we end up relying on each pod's span con…
craig bot pushed a commit that referenced this pull request Feb 10, 2022
75890: backupccl: add more exportable crdb data types to exportParquet r=stevendanna a=msbutler

Previously, Export Parquet could only export tables with a small subset of CRDB
data types. In this PR, I add support for all CRDB data types that avro
changefeeds support. Specifically, data types in the OID and Void families are
excluded as the former represents internal db objects and the ladder cannot be
parsed.  Further CRDB data types can be supported on an as need
basis.

This PR also adds a test that validates that EXPORT PARQUET properly encodes
data in arbitrary CRDB tables with supported data types.

Fixes  #67710

Release note (sql change): EXPORT PARQUET now supports all data types that avro
change feeds supports. Below I list the data type converstion from CRDB to
Parquet.

To learn about more about parquet data representation,
 see https://github.com/apache/parquet-format

CRDB Type Family ->  Parquet Type | Parquet Logical Type
Bool -> boolean | nil
String -> byte array | string
Collated String -> byte array | string
INet -> byte array | string
JSON -> byte array | json
Int -> int64 | nil
Float -> float64 | nil
Decimal -> byte array | decimal (note: scale and precision data are preserved
in the parquet file)
Uuid -> fixed length byte array (16 bytes) | nil
Bytes -> byte array | nil
Bit -> byte array | nil
Enum -> byte array | Enum
Box2d -> byte array | string
Geography -> byte array | nil
Geometry -> byte array | nil
Date -> byte array | string
Time -> int64 | time (note: microseconds since midnight)
TimeTz -> byte array | string
Interval -> byte array | string (specifically represented as ISO8601)
Timestamp -> byte array | string
TimestampTz -> byte array | string
Array -> encoded as a repeated field and each array value gets encoded by
pattern described above.

75897: jobs: enable marking jobs as idle for metrics r=samiskin a=samiskin

In order to allow long-running / indefinite jobs to not continually
block a tenant from being treated as inactive in a serverless
environment, this change adds a MarkIdle API to the jobs registry which
increases / decreases a counter of currently idle jobs. An idle job
should be able to be shut down at any moment.  Through the CurrentlyIdle
and CurrentlyRunning jobs numbers serverless can determine if all jobs
are idle and the tenant can be safely shut down.  For now this is only
saved in-memory in the registry object, however we may transition to
saving in persisted job state as well in
the future.

Fixes #74747

Release note: None


76174: server: reduce allocations in setupSpanForIncomingRPC r=andreimatei a=andreimatei

A couple of variables were independently escaping because they were
captured by a closure. This patch groups them into a single allocation,
saving two allocations per call.

name                        old time/op    new time/op    delta
SetupSpanForIncomingRPC-32     508ns ± 2%     442ns ± 0%  -12.91%  (p=0.008 n=5+5)
name                        old alloc/op   new alloc/op   delta
SetupSpanForIncomingRPC-32      178B ± 0%      145B ± 0%  -18.54%  (p=0.008 n=5+5)
name                        old allocs/op  new allocs/op  delta
SetupSpanForIncomingRPC-32      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.008 n=5+5)

Release note: None

76214: distsql: add plumbing for SQL pods to send RPCs to each other r=rharding6373 a=rharding6373

This PR adds initial support for distributed flows in a multi-tenant cluster
with multiple pods. It adds a test that assigns flows to multiple pods
and executes them. In order to support communication between SQL pods in
a multi-tenant environment, this PR adds a new kind of connector that
can provide address resolution for SQL instances, and plumbs a node
dialer that uses this new connector through distsql. If distsql is not
in multi-tenant mode, the same node dialer is used whether SQL instances
are communicating with other SQL instances or KV, as before this change.

Release note: None

76292: server: standardize error handling across all RPC endpoints r=jocrl a=knz

Arguably, there is still a lot of boilerplate in here, so a later
iteration of this work could consider handling errors at a higher
level.

However, before we can do this we want all the API handlers to handle
errors in a consistent manner, so that there's no surprise left during
the later refactoring.

This change also takes care of logging a few errors that were not
logged before.

Release note: None

76311: spanconfigsubscriber: pass ctx to KVSubscriber callbacks r=andreimatei a=andreimatei

One of these callbacks, in store.go, was capturing a ctx and using it
async. That was a tracing span use-after-finish. This patch fixes it by
plumbing contexts and avoiding the capture.

Release note: None

76313: settings: introduce system.tenant_settings table r=ajwerner a=ajwerner

Part of implementing the multi-tenant cluster settings RFC #73349. I was planning to post it along with more of the implementation but I'm posting it earlier for visibility, in case we need to make decisions related to system table IDs.

Release note: None

76317: util/tracing: improve a comment r=andreimatei a=andreimatei

Hint that there's no race between dropping a span's lock when it's in
the middle of finishing and the parent inserting that span into the
registry.

Release note: None

76318: util/trace: fix span reset  r=andreimatei a=andreimatei

We were forgetting to reset a field.

Release note: None

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Shiranka Miskin <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@RaduBerinde
Copy link
Member Author

In this RFC we initially proposed a fourth class of setting, one that is system and has only one value which tenants can read. We decided to drop it.

The main difference between this (let's call it SystemVisible) and TenantReadOnly is that if you change the setting, you have to also override it for tenants (using ALTER TENANT ALL).

This came up in a recent discussion about kv.closed_timestamp.target_duration and kv.protectedts.reconciliation.interval where this class would make sense.

Should we try to add this new class or not? And if we do, should you still be able to change the value seen by tenants with ALTER TENANT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.