-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement live reconfiguration in the compute_ctl
#3923
Conversation
Nice! What console behavior do you imagine if configuration fails (let's say due to the some escaping mismatch)? IIUC after receiving spec actual configuration here happens asynchronously. So |
This PR doesn't change this flow much, so we still rely on the fact that console / control-plane does all the validations beforehand. If it manages to push invalid spec, compute will be stuck. I was thinking about it, and I guess we need to teach compute_ctl to detect unrecoverable errors during postgres configuration and return them as error with code. That way console can notice them and mark operation as completely failed. Yet, this brings us the whole new story of operations undo. As it could be applied partially, i.e. we deleted one role, but failed to rename another, or failed to update config, and so on. Not sure how to approach it, yet
Yeah, I put a comment somewhere into the code, but the idea is that console checks that configuration is finished in some reasonable time. If it cannot, or if we cannot reach compute API, or API returns error, or any other problem we just fallback to stop / start again.
I thought that with blocking call we anyway can hit some timeout if compute is slow to respond or if some error happened, so we need a backup plan as mentioned above. Sync handler does solve the issue of doing many calls and can potentially improve latency by the polling interval time, though. It will require another communication channel from configurator thread back to http handler, I'll try to look on it. Probably it will be possible to do with |
"must specify --control-plane-uri \"{:#?}\" and --compute-id \"{:#?}\"", | ||
control_plane_uri, compute_id | ||
); | ||
panic!("must specify both --control-plane-uri and --compute-id or none"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, what does compute-id
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it's a polymorphic id, i.e. it could be an operation id if you want to bind compute to the spec of some specific operation; endpoint id -- unblocks live reconfiguration, as it allows fetching of the most recent spec for endpoint; and finally it will be an id of the new entity in the control-plane -- compute -- which ublocks pre-creation of the pool of computes that are not yet attached to any endpoint / tenant / timeline
e0eecb3
to
aa8d0b2
Compare
Made API blocking as @kelvich proposed. I think we should extend Also switched to Mutex + Condvar as I initially wanted, since now I needed bi-directional communication between API handler thread and configurator thread. It looks even better to me now, we update compute state under one lock in the API handler and then just wait for a wakeup with desired state. Same for configurator thread, it simply waits for wake up in |
Accept spec in JSON format and request compute reconfiguration from the configurator thread. If anything goes wrong after we set the compute state to `ConfigurationPending` and / or sent spec to the configurator thread, we basically leave compute in the potentially wrong state. That said, it's control-plane's responsibility to watch compute state after reconfiguration request and to clean restart it in case of errors. It still lacks ability of starting up without spec and some validations, i.e. that live reconfiguration should be only available with `--compute-id` and `--control-plane-uri` options. Otherwise, it works fine and could be tested by running `compute_ctl` locally, then sending it a new spec: ```shell curl -d "$(cat ./compute-spec-new.json)" http://localhost:3080/spec ``` We have one configurator thread and async http server, so generally we have single consumer - multiple producers pattern here. That's why we use `mpsc` channel, not `tokio::sync::watch`. Actually, concurrency of producers is limited to one due to code logic, but we still need an ability to potentially pass `Sender` to several threads. Next, we use async `hyper` + `tokio` http server, but all the other code is completely synchronous. So we need to send data from async to sync, that's why we use `mpsc::unbounded_channel` here, not `mpsc::channel`. It doesn't make much sense to rewrite all code to async now, but we can consider doing this in the future. I think that a combination of `Mutex` and `CondVar` would work just fine too, but as we already have `tokio`, I decided to try something from it.
With this commit one can start compute with something like ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://[email protected]:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` and it will hang waiting for spec. Then send one spec ```shell curl -d "$(cat ./compute-spec.json)" http://localhost:3080/spec ``` Postgres will be started and configured. Then reconfigure it with ```shell curl -d "$(cat ./compute-spec-new.json)" http://localhost:3080/spec ``` Most of safeguards and comments are added. Some polishing especially around HTTP API is still needed.
aa8d0b2
to
4e6a23f
Compare
I've polished API handler a bit, added proper error responses and refreshed OpenAPI spec. Played with it locally and it worked just fine for me. I think it's generally ready to go / to be reviewed. Existing e2e tests were passing all the way through, so existing compute flows shouldn't be affected. Going to switch to control-plane PoC part. |
Test results for 18226ac:debug build: 212 tests run: 202 passed, 0 failed, 10 (full report)release build: 212 tests run: 202 passed, 0 failed, 10 (full report) |
After sleeping with it and talking today with @kelvich, I've finally understood the problem with combination of sync and async code :) I used
What are the other options? I had these two in mind (not 100% sure that each one will work well, though):
In overall, I'd prefer to go with how it's implemented in the current PR state and do experiments on the control-plane side, but I do understand that current combination of sync and async code isn't really sane. @hlinnaka @LizardWizzard What do you think? |
Consider splitting this in two:
Both of those make sense to me, but I think they would be easier to review separately. Let's also have separate APIs for those two operations. They're quite different, so I don't think calling them both "/configure" is a good idea. The control plane keeps track of the state, so it always knows which one to call. |
For the live reconfiguration part: You cannot change all parts of the spec file at 'reconfigure'. Only the roles, databases, etc. You cannot change the tenant_id for example. In theory, you could change the pageserver and safekeeper connection strings, but tha't s not implemented here. And I've got a feeling that the compute should get those from the storage broker, rather than the control plane. Let's introduce a new JSON document type for this, which includes only the fields that can be changed. I'm introducing the new spec file format V2 in PR #3923, but in light of this PR, I wonder if we should change it so that there is a separate section for the things that can be reconfigured. So the spec file would look like:
The /reconfigure endpoint could accept just the "cluster_config" part, then. Funnily enough, we almost have that separation in the spec V1 format already. I think that was the original intention, but we got confused at some point and added "cluster_id" and "name" fields to the "cluster" part, where they don't really belong. |
I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether? |
// We only allow live re- / configuration of the compute node if | ||
// it uses 'pull model', i.e. it can go to control-plane and fetch | ||
// the latest configuration. Otherwise, there could be a case: | ||
// - we start compute with some spec provided as argument | ||
// - we push new spec and it does reconfiguration | ||
// - but then something happens and compute pod / VM is destroyed, | ||
// so k8s controller starts it again with the **old** spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
Perhaps we should remove the roles/databases/etc. from the compute spec altogether, and not try to sync them at compute start. Instead, the 'apply_config' operation would first start up the compute, and then perform the reconfiguration by calling the '/reconfigure' API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these two questions are related
I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether?
As mentioned in the code comment, even if we properly start and configure compute node and finish control-plane operation, compute still could be restarted / recreated by k8s without control-plane being aware of that.
It means that compute needs a mechanism of getting a complete up-to-date spec on each start. And that's what get_spec_from_control_plane()
provides.
Perhaps we should remove the roles/databases/etc. from the compute spec altogether, and not try to sync them at compute start. Instead, the 'apply_config' operation would first start up the compute, and then perform the reconfiguration by calling the '/reconfigure' API.
We can do that, but current flow simplifies some things. I.e. we have only one operation / action apply_config
, which does everything. After this PR it will only need to know whether it should do compute restart or not. If we do change you propose, the logic will be more complex for compute start: i) if it's first start, we need both start_compute
, which will set all needed GUCs + apply_config
, which will create initial roles and dbs; ii) if it's some second start, then only start_compute
should be enough
This could be a good idea, but it's not directly related to the current target: startup latency and fast reconfiguration. I'd address these two first and then think about splitting this logic. Keeping control-plane logic the same for now will help to iterate faster on it, imo
I'm also a bit afraid, how this could interfere with what Sasha does for SQL roles / dbs management, as with his changes we will have bi-directional flow of roles / db between control-plane and compute, and I'd prefer to let it settle down before we do similar changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these two questions are related
I'm glad you're introducing the Empty state where compute_ctl waits for the control plane to push the spec file. That feels like the right approach to me. Can we get rid of the get_spec_from_control_plane() code altogether?
As mentioned in the code comment, even if we properly start and configure compute node and finish control-plane operation, compute still could be restarted / recreated by k8s without control-plane being aware of that.
Hmm. What's the point of supporting pod restart in the first place? All the connections will be lost, and the caches will be cold. If we simply let the pod to die and not restart, then the control plane will notice that and restart the compute on next connection.
It would be kind of nice to support pod restarts for the sake of robustness: a compute could be restarted even if control plane is not working for some reason. However, if the compute needs to make a call to the control plane at restart, it will require the control plane anyway.
It means that compute needs a mechanism of getting a complete up-to-date spec on each start. And that's what
get_spec_from_control_plane()
provides.
Hmm. How about the control plane do this:
- Pre-create a compute node. It starts into Empty state, and waits for the control plane to call the
/configure
API.
When it's time to "bind" the compute to a particular endpoint:
- Update the console db with the chosen pre-created compute-id.
- Call /configure with the spec.
- Once it finished, let the users to connect to it.
Then in the background:
- Create/update a ConfigMap with the spec file. If the pod is restarted, compute_ctl can read the spec file from the ConfigMap.
- (Optional) Update the labels in Kubernetes with the endpoint ID, project ID etc. other nice to have information. This is not strictly required, but makes an admin's life easier.
If the pod is restarted between steps 3 and 4, it will still not be able to restart. But that's a short window, and if a pod crashes and restarts right after start, the control plane can destroy and start a new pod instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about config maps initially, but they are mounted into pods as volumes. I recall that we had hacks for VMs to add some files into it, i.e. we attach initial startup options as CD-ROM, IIRC. cc @cicdteam
With pull model after steps 1-3 we don't need 4, only 5 is needed, as we connected compute-id with endpoint in the control-plane db, so it can get a new spec at any moment. No additional work on the neonvm
side to support attaching/updating configmaps, no any potential overhead from supporting N configmaps in the k8s. BTW, I think that 5 should be 1 (or at least before current 2), actually. It's needed to properly rout and label logs, so that all logs / metrics after we attach workload to empty compute were properly labeled with endpoitn_id, tenant, etc.
There could be a valid concern on the availability, but control-plane is HA service with several instances already, after split it should be even more robust. The good part is that we are fully in control of latency, i.e. we can add spec cache and it will be super-fast, what we cannot say about config maps, as the only guarantee is that they should be eventually updated and synchronized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of supporting pod restart in the first place?
The same as with empty -- to get rid of container / VM spin-up overhead. Yes, we can take a new one from the pull, but: i) we still need to wait until the old is completely dead (which can take some time), cause we need only one writer at a time, otherwise they will be fighting for safekeepers; and ii) pool could be exhausted anyway, so we will need a complete startup process
I don't propose to do it right away, though, it's just a theoretical option
I can do that, sure.
That's right. For some reason I though that ps / sk connstrings are |
About having two separate endpoints. Yes, we cannot change some parts of the compute, but internally the logic is pretty much the same: we update configuration parameters in the shared state and notify waiting thread, then waiting for a Running state. I was also thinking about adding some sanity checks, i.e. that tenant/timeline are the same. Also, if compute will be restarted after reconfiguration, it will still fetch the complete spec from the control-plane, so why not to post it there initially during reconfiguration.
We can, actually :) I mean we won't do it in our cloud, definitely, as it's not safe. But I can easily imagine that if someone is running That said, yes, we can have two separate endpoints and we can have not, but nothing really changes from the functionality and correctness point of view, so it just adds more complexity into the control-plane and |
Ok, let's leave that part as it is, then. |
When 'compute_ctl' is in the Empty state, I think there should be a heartbeat mechanism that terminates the pod automatically if the control plane doesn't contact it for more than N seconds. We don't want to have pre-created pods hanging around forever, if the control plane loses track of them for some reason. Alternatively, the control plane needs to carefully track the pre-created computes so that it can clean them up if needed. But I think it's better if they just die out automatically if they're not used. (We can add that as a follow-up PR) |
Yeah, also thought about it. As we briefly mentioned on the call, there are several dimensions of the pool: compute version, postgres version, AZ. Compute version can easily become stale after storage release, so it's not needed anymore. I thought about making a GC process in the control-plane. It anyway should record all spawned computes in the db, where we can easily claim one with SELECT FOR UPDATE in the provisioner code. That way, we can periodically go over them and check whether they are still needed or not, we can also do liveness checks using |
I think it would be more straightforward to treat pre-created pods the same way as compute pods from the control-plane side. And if the control plane looses track of compute node we are really in a bad situation, which need to be fixed asap. So not sure why we may need such protection for pre-created pods. |
Closing as the last reconfiguration part was separated into #3980 |
Describe your changes
Accept spec in JSON format and request compute reconfiguration from the configurator thread. If anything goes wrong after we set the compute state to
ConfigurationPending
and / or sent spec to the configurator thread, we basically leave compute in the potentially wrong state. That said, it's control-plane's responsibility to watch compute state after reconfiguration request and to clean restart it in case of errors.With this patch one can start compute with something like
cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://[email protected]:5434/postgres" \ -b ./pg_install/v15/bin/postgres
and it will hang waiting for spec.
Then send one spec
curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure
Postgres will be started and configured.
Then reconfigure it with
curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure
Issue ticket number and link
Resolves neondatabase/cloud#4433
Checklist before requesting a review
Checklist before merging