Implement live reconfiguration in the `compute_ctl` #3923

ololobus · 2023-03-31T14:03:43Z

Describe your changes

Accept spec in JSON format and request compute reconfiguration from the configurator thread. If anything goes wrong after we set the compute state to ConfigurationPending and / or sent spec to the configurator thread, we basically leave compute in the potentially wrong state. That said, it's control-plane's responsibility to watch compute state after reconfiguration request and to clean restart it in case of errors.

With this patch one can start compute with something like

cargo run --bin compute_ctl -- -i no-compute \
 -p http://localhost:9095 \
 -D compute_pgdata \
 -C "postgresql://[email protected]:5434/postgres" \
 -b ./pg_install/v15/bin/postgres

and it will hang waiting for spec.

Then send one spec

curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure

Postgres will be started and configured.

Then reconfigure it with

curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure

Issue ticket number and link

Resolves neondatabase/cloud#4433

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Consider adding an ability to start without spec
Do some code polishing
Review all new code comments
Write a PoC control-plane part

kelvich · 2023-03-31T18:27:28Z

Nice! What console behavior do you imagine if configuration fails (let's say due to the some escaping mismatch)? IIUC after receiving spec actual configuration here happens asynchronously. So apply_config should send spec then query postgres to validate that it was applied? Wouldn't it be easier to make "/spec" synchronous and retry only spec sending http call until we get a good status response?

ololobus · 2023-03-31T18:48:12Z

What console behavior do you imagine if configuration fails (let's say due to the some escaping mismatch)?

This PR doesn't change this flow much, so we still rely on the fact that console / control-plane does all the validations beforehand. If it manages to push invalid spec, compute will be stuck. I was thinking about it, and I guess we need to teach compute_ctl to detect unrecoverable errors during postgres configuration and return them as error with code. That way console can notice them and mark operation as completely failed.

Yet, this brings us the whole new story of operations undo. As it could be applied partially, i.e. we deleted one role, but failed to rename another, or failed to update config, and so on. Not sure how to approach it, yet

IIUC after receiving spec actual configuration here happens asynchronously. So apply_config should send spec then query postgres to validate that it was applied?

Yeah, I put a comment somewhere into the code, but the idea is that console checks that configuration is finished in some reasonable time. If it cannot, or if we cannot reach compute API, or API returns error, or any other problem we just fallback to stop / start again.

Wouldn't it be easier to make "/spec" synchronous and retry only spec sending http call until we get a good status response?

I thought that with blocking call we anyway can hit some timeout if compute is slow to respond or if some error happened, so we need a backup plan as mentioned above. Sync handler does solve the issue of doing many calls and can potentially improve latency by the polling interval time, though. It will require another communication channel from configurator thread back to http handler, I'll try to look on it. Probably it will be possible to do with CondVar

vadim2404 · 2023-04-03T09:13:57Z

compute_tools/src/bin/compute_ctl.rs

-                        "must specify --control-plane-uri \"{:#?}\" and --compute-id \"{:#?}\"",
-                        control_plane_uri, compute_id
-                    );
+                    panic!("must specify both --control-plane-uri and --compute-id or none");


just curious, what does compute-id mean?

Right now it's a polymorphic id, i.e. it could be an operation id if you want to bind compute to the spec of some specific operation; endpoint id -- unblocks live reconfiguration, as it allows fetching of the most recent spec for endpoint; and finally it will be an id of the new entity in the control-plane -- compute -- which ublocks pre-creation of the pool of computes that are not yet attached to any endpoint / tenant / timeline

ololobus · 2023-04-03T18:52:56Z

Made API blocking as @kelvich proposed. I think we should extend /status in a similar way later to optionally pass desired state to make it blocking too. That'd ease polling intensity and may generally improve latency a bit

Also switched to Mutex + Condvar as I initially wanted, since now I needed bi-directional communication between API handler thread and configurator thread. It looks even better to me now, we update compute state under one lock in the API handler and then just wait for a wakeup with desired state. Same for configurator thread, it simply waits for wake up in PendingConfiguration state

Accept spec in JSON format and request compute reconfiguration from the configurator thread. If anything goes wrong after we set the compute state to `ConfigurationPending` and / or sent spec to the configurator thread, we basically leave compute in the potentially wrong state. That said, it's control-plane's responsibility to watch compute state after reconfiguration request and to clean restart it in case of errors. It still lacks ability of starting up without spec and some validations, i.e. that live reconfiguration should be only available with `--compute-id` and `--control-plane-uri` options. Otherwise, it works fine and could be tested by running `compute_ctl` locally, then sending it a new spec: ```shell curl -d "$(cat ./compute-spec-new.json)" http://localhost:3080/spec ``` We have one configurator thread and async http server, so generally we have single consumer - multiple producers pattern here. That's why we use `mpsc` channel, not `tokio::sync::watch`. Actually, concurrency of producers is limited to one due to code logic, but we still need an ability to potentially pass `Sender` to several threads. Next, we use async `hyper` + `tokio` http server, but all the other code is completely synchronous. So we need to send data from async to sync, that's why we use `mpsc::unbounded_channel` here, not `mpsc::channel`. It doesn't make much sense to rewrite all code to async now, but we can consider doing this in the future. I think that a combination of `Mutex` and `CondVar` would work just fine too, but as we already have `tokio`, I decided to try something from it.

With this commit one can start compute with something like ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://[email protected]:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` and it will hang waiting for spec. Then send one spec ```shell curl -d "$(cat ./compute-spec.json)" http://localhost:3080/spec ``` Postgres will be started and configured. Then reconfigure it with ```shell curl -d "$(cat ./compute-spec-new.json)" http://localhost:3080/spec ``` Most of safeguards and comments are added. Some polishing especially around HTTP API is still needed.

ololobus · 2023-04-04T14:58:48Z

I've polished API handler a bit, added proper error responses and refreshed OpenAPI spec. Played with it locally and it worked just fine for me. I think it's generally ready to go / to be reviewed. Existing e2e tests were passing all the way through, so existing compute flows shouldn't be affected. Going to switch to control-plane PoC part.

github-actions · 2023-04-04T15:11:57Z

Test results for `18226ac`:

debug build: 212 tests run: 202 passed, 0 failed, 10 (full report)

release build: 212 tests run: 202 passed, 0 failed, 10 (full report)

ololobus · 2023-04-05T14:56:48Z

After sleeping with it and talking today with @kelvich, I've finally understood the problem with combination of sync and async code :)

I used Condvar and it only has sync wait(), so waiting on it in the http server handler actually blocks one tokio worker thread. Sincerely, I don't think it's a real problem due to the way it's intended to be used, i.e. only one or two blocking calls (waiting on the Condvar) to the API at one moment. So setting worker_threads = 8 solves this completely. Even if we consider the worst case, i.e. all http calls to the compute are stuck, it's control-plane responsibility to fall back to stop/start on timeout.

Condvar semantics is pretty neat, as we have some shared state and we only want to update it under lock and notify potential waiters that state changed, so they check if this is a desired state, and proceed with some work. That's why I still think that it rather OK for the first iteration.

What are the other options? I had these two in mind (not 100% sure that each one will work well, though):

Use mpsc::unbounded_channel to communicate from API handlers to the thread waiting for configuration, as I did in the first commit in this PR, it allows communication from async code to sync one. We will still need a backward communication channel to notify that configuration changed, there is tokio::sync::watch, but I'm not sure it will work between sync and async code, it needs checking.
Rewrite everything to async and use tokio::sync::Notify, which itself is OK, but less convenient, as one need to still hold a Mutex / lock to get a new state after being notified. Also rewriting essentially sync configuration / startup code to be formally async doesn't sound great to me.

In overall, I'd prefer to go with how it's implemented in the current PR state and do experiments on the control-plane side, but I do understand that current combination of sync and async code isn't really sane.

@hlinnaka @LizardWizzard What do you think?

hlinnaka · 2023-04-05T17:34:48Z

Consider splitting this in two:

The live reconfiguration.
The mode where it starts without spec, and waits for the control plane to send it.

Both of those make sense to me, but I think they would be easier to review separately.

Let's also have separate APIs for those two operations. They're quite different, so I don't think calling them both "/configure" is a good idea. The control plane keeps track of the state, so it always knows which one to call.

hlinnaka · 2023-04-05T17:36:04Z

For the live reconfiguration part:

You cannot change all parts of the spec file at 'reconfigure'. Only the roles, databases, etc. You cannot change the tenant_id for example. In theory, you could change the pageserver and safekeeper connection strings, but tha't s not implemented here. And I've got a feeling that the compute should get those from the storage broker, rather than the control plane. Let's introduce a new JSON document type for this, which includes only the fields that can be changed.

I'm introducing the new spec file format V2 in PR #3923, but in light of this PR, I wonder if we should change it so that there is a separate section for the things that can be reconfigured. So the spec file would look like:

{
  "format-version": 2,
  "tenant_id": "..."
  "pageserver_connstring": "...",
  ...
  "cluster_config": {
     roles: { ... },
     databases: { ... },
     extensions: { ... },
     delta_operations: { ... },
  }
}

The /reconfigure endpoint could accept just the "cluster_config" part, then.

Funnily enough, we almost have that separation in the spec V1 format already. I think that was the original intention, but we got confused at some point and added "cluster_id" and "name" fields to the "cluster" part, where they don't really belong.

hlinnaka · 2023-04-05T17:39:18Z

I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether?

hlinnaka · 2023-04-05T18:12:11Z

compute_tools/src/compute.rs

+    // We only allow live re- / configuration of the compute node if
+    // it uses 'pull model', i.e. it can go to control-plane and fetch
+    // the latest configuration. Otherwise, there could be a case:
+    // - we start compute with some spec provided as argument
+    // - we push new spec and it does reconfiguration
+    // - but then something happens and compute pod / VM is destroyed,
+    //   so k8s controller starts it again with the **old** spec


That's a good point.

Perhaps we should remove the roles/databases/etc. from the compute spec altogether, and not try to sync them at compute start. Instead, the 'apply_config' operation would first start up the compute, and then perform the reconfiguration by calling the '/reconfigure' API.

I think that these two questions are related

I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether?

As mentioned in the code comment, even if we properly start and configure compute node and finish control-plane operation, compute still could be restarted / recreated by k8s without control-plane being aware of that.

It means that compute needs a mechanism of getting a complete up-to-date spec on each start. And that's what get_spec_from_control_plane() provides.

Perhaps we should remove the roles/databases/etc. from the compute spec altogether, and not try to sync them at compute start. Instead, the 'apply_config' operation would first start up the compute, and then perform the reconfiguration by calling the '/reconfigure' API.

We can do that, but current flow simplifies some things. I.e. we have only one operation / action apply_config, which does everything. After this PR it will only need to know whether it should do compute restart or not. If we do change you propose, the logic will be more complex for compute start: i) if it's first start, we need both start_compute, which will set all needed GUCs + apply_config, which will create initial roles and dbs; ii) if it's some second start, then only start_compute should be enough

This could be a good idea, but it's not directly related to the current target: startup latency and fast reconfiguration. I'd address these two first and then think about splitting this logic. Keeping control-plane logic the same for now will help to iterate faster on it, imo

I'm also a bit afraid, how this could interfere with what Sasha does for SQL roles / dbs management, as with his changes we will have bi-directional flow of roles / db between control-plane and compute, and I'd prefer to let it settle down before we do similar changes

I think that these two questions are related

I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether?

As mentioned in the code comment, even if we properly start and configure compute node and finish control-plane operation, compute still could be restarted / recreated by k8s without control-plane being aware of that.

Hmm. What's the point of supporting pod restart in the first place? All the connections will be lost, and the caches will be cold. If we simply let the pod to die and not restart, then the control plane will notice that and restart the compute on next connection.

It would be kind of nice to support pod restarts for the sake of robustness: a compute could be restarted even if control plane is not working for some reason. However, if the compute needs to make a call to the control plane at restart, it will require the control plane anyway.

It means that compute needs a mechanism of getting a complete up-to-date spec on each start. And that's what get_spec_from_control_plane() provides.

Hmm. How about the control plane do this:

Pre-create a compute node. It starts into Empty state, and waits for the control plane to call the /configure API.

When it's time to "bind" the compute to a particular endpoint:

Update the console db with the chosen pre-created compute-id.

Call /configure with the spec.

Once it finished, let the users to connect to it.

Then in the background:

Create/update a ConfigMap with the spec file. If the pod is restarted, compute_ctl can read the spec file from the ConfigMap.

(Optional) Update the labels in Kubernetes with the endpoint ID, project ID etc. other nice to have information. This is not strictly required, but makes an admin's life easier.

If the pod is restarted between steps 3 and 4, it will still not be able to restart. But that's a short window, and if a pod crashes and restarts right after start, the control plane can destroy and start a new pod instead.

I thought about config maps initially, but they are mounted into pods as volumes. I recall that we had hacks for VMs to add some files into it, i.e. we attach initial startup options as CD-ROM, IIRC. cc @cicdteam

With pull model after steps 1-3 we don't need 4, only 5 is needed, as we connected compute-id with endpoint in the control-plane db, so it can get a new spec at any moment. No additional work on the neonvm side to support attaching/updating configmaps, no any potential overhead from supporting N configmaps in the k8s. BTW, I think that 5 should be 1 (or at least before current 2), actually. It's needed to properly rout and label logs, so that all logs / metrics after we attach workload to empty compute were properly labeled with endpoitn_id, tenant, etc.

There could be a valid concern on the availability, but control-plane is HA service with several instances already, after split it should be even more robust. The good part is that we are fully in control of latency, i.e. we can add spec cache and it will be super-fast, what we cannot say about config maps, as the only guarantee is that they should be eventually updated and synchronized

What's the point of supporting pod restart in the first place?

The same as with empty -- to get rid of container / VM spin-up overhead. Yes, we can take a new one from the pull, but: i) we still need to wait until the old is completely dead (which can take some time), cause we need only one writer at a time, otherwise they will be fighting for safekeepers; and ii) pool could be exhausted anyway, so we will need a complete startup process

I don't propose to do it right away, though, it's just a theoretical option

ololobus · 2023-04-05T18:40:21Z

Consider splitting this in two

I can do that, sure.

You cannot change the tenant_id for example.
In theory, you could change the pageserver and safekeeper connection strings, but tha't s not implemented here.

That's right. For some reason I though that ps / sk connstrings are SIGHUP, but they are postmaster. Good to know

ololobus · 2023-04-05T19:16:22Z

The /reconfigure endpoint could accept just the "cluster_config" part, then.

About having two separate endpoints. Yes, we cannot change some parts of the compute, but internally the logic is pretty much the same: we update configuration parameters in the shared state and notify waiting thread, then waiting for a Running state.

I was also thinking about adding some sanity checks, i.e. that tenant/timeline are the same. Also, if compute will be restarted after reconfiguration, it will still fetch the complete spec from the control-plane, so why not to post it there initially during reconfiguration.

You cannot change the tenant_id for example.

We can, actually :) I mean we won't do it in our cloud, definitely, as it's not safe. But I can easily imagine that if someone is running neon locally and trusts the workload, they could do that. For example, we can add restart: bool flag into the current /configure payload and setting it to true will mean that it need to do a clean start inside pod / VM with basebackup and safekeepers sync. So one can convert a Running compute to Empty or repurpose it on the fly. Even in our cloud I can imagine that for switching endpoint branches (timelines).

That said, yes, we can have two separate endpoints and we can have not, but nothing really changes from the functionality and correctness point of view, so it just adds more complexity into the control-plane and compute_ctl at this moment

hlinnaka · 2023-04-05T21:15:07Z

Perhaps we should remove the roles/databases/etc. from the compute spec altogether, and not try to sync them at compute start. Instead, the 'apply_config' operation would first start up the compute, and then perform the reconfiguration by calling the '/reconfigure' API.

We can do that, but current flow simplifies some things. I.e. we have only one operation / action apply_config, which does everything. After this PR it will only need to know whether it should do compute restart or not. If we do change you propose, the logic will be more complex for compute start: i) if it's first start, we need both start_compute, which will set all needed GUCs + apply_config, which will create initial roles and dbs; ii) if it's some second start, then only start_compute should be enough

This could be a good idea, but it's not directly related to the current target: startup latency and fast reconfiguration. I'd address these two first and then think about splitting this logic. Keeping control-plane logic the same for now will help to iterate faster on it, imo

I'm also a bit afraid, how this could interfere with what Sasha does for SQL roles / dbs management, as with his changes we will have bi-directional flow of roles / db between control-plane and compute, and I'd prefer to let it settle down before we do similar changes

Ok, let's leave that part as it is, then.

hlinnaka · 2023-04-05T21:19:09Z

When 'compute_ctl' is in the Empty state, I think there should be a heartbeat mechanism that terminates the pod automatically if the control plane doesn't contact it for more than N seconds. We don't want to have pre-created pods hanging around forever, if the control plane loses track of them for some reason. Alternatively, the control plane needs to carefully track the pre-created computes so that it can clean them up if needed. But I think it's better if they just die out automatically if they're not used.

(We can add that as a follow-up PR)

ololobus · 2023-04-06T10:56:15Z

@hlinnaka I've split the first part into a smaller PR here #3963. It's about empty state and reconfiguration could be added on top of it

ololobus · 2023-04-06T11:15:37Z

When 'compute_ctl' is in the Empty state, I think there should be a heartbeat mechanism that terminates the pod automatically if the control plane doesn't contact it for more than N seconds. We don't want to have pre-created pods hanging around forever, if the control plane loses track of them for some reason. Alternatively, the control plane needs to carefully track the pre-created computes so that it can clean them up if needed. But I think it's better if they just die out automatically if they're not used.

Yeah, also thought about it. As we briefly mentioned on the call, there are several dimensions of the pool: compute version, postgres version, AZ. Compute version can easily become stale after storage release, so it's not needed anymore. I thought about making a GC process in the control-plane. It anyway should record all spawned computes in the db, where we can easily claim one with SELECT FOR UPDATE in the provisioner code. That way, we can periodically go over them and check whether they are still needed or not, we can also do liveness checks using /status API on compute

kelvich · 2023-04-06T11:38:38Z

if the control plane loses track of them for some reason

I think it would be more straightforward to treat pre-created pods the same way as compute pods from the control-plane side. And if the control plane looses track of compute node we are really in a bad situation, which need to be fixed asap. So not sure why we may need such protection for pre-created pods.

ololobus · 2023-04-07T11:09:38Z

Closing as the last reconfiguration part was separated into #3980

ololobus requested a review from a team as a code owner March 31, 2023 14:03

ololobus requested review from NanoBjorn, hlinnaka and lubennikovaav and removed request for a team March 31, 2023 14:03

ololobus mentioned this pull request Mar 31, 2023

Epic: Create the pool of pre-allocated VMs neondatabase/autoscaling#115

Closed

3 tasks

NanoBjorn removed their request for review March 31, 2023 14:26

vadim2404 reviewed Apr 3, 2023

View reviewed changes

ololobus force-pushed the alexk/compute-configure branch from e0eecb3 to aa8d0b2 Compare April 3, 2023 18:40

ololobus added 4 commits April 4, 2023 16:50

Use Condvar and make configuration API blocking

6955282

Polish API handler and refresh OpenAPI spec

4e6a23f

ololobus force-pushed the alexk/compute-configure branch from aa8d0b2 to 4e6a23f Compare April 4, 2023 14:51

ololobus requested review from vadim2404 and LizardWizzard April 4, 2023 15:29

hlinnaka reviewed Apr 5, 2023

View reviewed changes

hlinnaka mentioned this pull request Apr 5, 2023

Move compute_ctl structs used in HTTP API and spec file to separate crate. #3937

Merged

ololobus mentioned this pull request Apr 5, 2023

[compute_ctl] Empty computes and /configure API #3963

Merged

5 tasks

hlinnaka mentioned this pull request Apr 5, 2023

Implement live reconfiguration in the compute_ctl, rebased #3964

Closed

ololobus closed this Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement live reconfiguration in the `compute_ctl` #3923

Implement live reconfiguration in the `compute_ctl` #3923

ololobus commented Mar 31, 2023 •

edited

Loading

kelvich commented Mar 31, 2023

ololobus commented Mar 31, 2023

vadim2404 Apr 3, 2023

ololobus Apr 3, 2023

ololobus commented Apr 3, 2023 •

edited

Loading

ololobus commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

ololobus commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka Apr 5, 2023

ololobus Apr 5, 2023

hlinnaka Apr 5, 2023

ololobus Apr 6, 2023

ololobus Apr 6, 2023 •

edited

Loading

ololobus commented Apr 5, 2023

ololobus commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

ololobus commented Apr 6, 2023

ololobus commented Apr 6, 2023 •

edited

Loading

kelvich commented Apr 6, 2023

ololobus commented Apr 7, 2023

Implement live reconfiguration in the compute_ctl #3923

Implement live reconfiguration in the compute_ctl #3923

Conversation

ololobus commented Mar 31, 2023 • edited Loading

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Checklist before merging

kelvich commented Mar 31, 2023

ololobus commented Mar 31, 2023

vadim2404 Apr 3, 2023

Choose a reason for hiding this comment

ololobus Apr 3, 2023

Choose a reason for hiding this comment

ololobus commented Apr 3, 2023 • edited Loading

ololobus commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

Test results for 18226ac:

debug build: 212 tests run: 202 passed, 0 failed, 10 (full report)

release build: 212 tests run: 202 passed, 0 failed, 10 (full report)

ololobus commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka Apr 5, 2023

Choose a reason for hiding this comment

ololobus Apr 5, 2023

Choose a reason for hiding this comment

hlinnaka Apr 5, 2023

Choose a reason for hiding this comment

ololobus Apr 6, 2023

Choose a reason for hiding this comment

ololobus Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

ololobus commented Apr 5, 2023

ololobus commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

hlinnaka commented Apr 5, 2023

ololobus commented Apr 6, 2023

ololobus commented Apr 6, 2023 • edited Loading

kelvich commented Apr 6, 2023

ololobus commented Apr 7, 2023

Implement live reconfiguration in the `compute_ctl` #3923

Implement live reconfiguration in the `compute_ctl` #3923

ololobus commented Mar 31, 2023 •

edited

Loading

ololobus commented Apr 3, 2023 •

edited

Loading

Test results for `18226ac`:

ololobus Apr 6, 2023 •

edited

Loading

ololobus commented Apr 6, 2023 •

edited

Loading