-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon concurrent rpc operations #682
Conversation
Overall I am happy with this and its direction. I was hoping to avoid locking in VirtualMachine but for now it is clear it translates most directly and so is less likely to cause breakage. So +1 so far! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I just had a quick look to get an overview (would need more study to understand it better). My only concerns at this point are some stuff that still needs locking (e.g. async_future_watchers
) and the unscoped locking/unlocking approach (deadlock magnet IMO). https://en.cppreference.com/w/cpp/thread/scoped_lock is available in C++17
Hey guys! Thanks for the initial reviews. I agree that locking is rough, but it will be refined over time, ie, scoped locking of some sort (either c++17 Since it was mentioned, I don't think |
0ae6811
to
1d66af9
Compare
Codecov Report
@@ Coverage Diff @@
## master #682 +/- ##
==========================================
- Coverage 66.51% 66.46% -0.05%
==========================================
Files 174 174
Lines 6042 6093 +51
==========================================
+ Hits 4019 4050 +31
- Misses 2023 2043 +20
Continue to review full report at Codecov.
|
Ok, I think this is ready for review for the first iteration of concurrency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-wise, I think it looks great. Proof is in the testing, which I'll start tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting a first review bit only. So far it looks really nice and I have only a couple detail proposals. I am far from wrapping my head around the whole thing though.
Right now having some trouble understanding the meaning of the async methods in the daemon. It would be helpful if you could provide a brief explanation of the intent of lines 151-167 in daemon.h? Also, why was the ServerContext dropped from RPC methods? Perhaps a better question would be why it was there in the first place? Thanks in advance.
Thanks for the review. Yes, it's quite a piece of code:) I'll answer your general questions here.
Sure, will do!
Previously, the |
I meant here or over IRC, just to ease reviewing. I did not mean in the code, although I am fine with that too if you prefer.
OK, so it was just a leftover param from the Rpc signatures that was propagated to all daemon stuff even thought it was never used, correct? We should drop -Wno-unused-parameter to avoid this sort of thing... |
Ok, I'll explain those lines here. The
I hope this clears it up a bit. Regarding removing |
OK, thanks for the explanation. So basically the *async* methods correspond to the long operations that should be offloaded, right? On unused parameter, replying in #697 |
Yep, exactly. |
This will allow the daemon to spawn off long running operations and then service any new operations requested by the RPC. Fixes #643
This is due to making the affected functions no longer override the grpc Server class functions.
Also makes mounting in start async.
Also commonize the creation of the QFutureWatcher.
Make Daemon command functions virtual for mocking. Add default mock method calls to set the value of the promise.
…back into vm object
3ae71ac
to
e59f178
Compare
I found a misbehaviour: Terminal 1: Run
because the download in Terminal 1 was still in progress. Terminal 1 download then failed with "launch failed: Cannot open image file for computing hash" So maybe we need to check if 2 VMs are trying to launch simultaneously using the same download, and delay things? |
I kind of think this related to an already existing bug, #664. And also, the downloading of images part has not been touched by this PR. I plan on making the image downloads better by making them asynchronous and also only try to download once. I'd really like to defer that work though for the next iteration of this. This PR is more about making |
@townsend2010 yep fine, I'm not asking for everything to be perfect in this MP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not finish the review completely, but this is most of it, so I think we can iterate on it already. I have a bunch of questions and a few requests.
@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start) | |||
NiceMock<mpt::MockVMStatusMonitor> mock_monitor; | |||
auto machine = backend.create_virtual_machine(default_description, mock_monitor); | |||
|
|||
EXPECT_CALL(mock_monitor, persist_state_for(_)); | |||
EXPECT_CALL(mock_monitor, persist_state_for(_, _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well check the parameter is right in updated EXPECT_CALLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could, but will it gain us anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would confirm that the transition left the machine in the right state.
virDomainDestroy(domain.get()); | ||
state = State::off; | ||
update_state(); | ||
state_wait.wait(lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused about this synchronization bit, but let me try to explain my concerns:
- Assuming this does need to wait for the notification, it needs guarding against "spurious wakes" (see item 3 on waiting here. IOW, the caller needs to check that wake was legit, which can be done by passing a predicate to wait (2nd form).
- But then we can deadlock, namely when we get into this
if
and the starting thread meanwhile moves the state to running (suppose it was just about to do it when we get here)... - But, actually, why is the wait needed here in the first place? See question below on
notify_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my answer in #682 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK replying to that part there, but what about the first two points?
To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up
, it already hit the process_vm_events
/ensure_running
thing and is now doing mp::utils::try_action_for
(this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to address the other 2 parts:)
So, for spurious wakes, yes, I need to add a predicate. I forgot about that, so will do. Thanks!
For deadlocks, I'm unsure of your scenario. There are also locks in the action
and on_timeout
lambdas in mp::utils::wait_until_ssh_up()
so state won't be set or read unless the lock is being held. I'll think about this some more just to make sure I'm not missing something.
std::lock_guard<decltype(state_mutex)> lock{state_mutex}; | ||
if (domain_state_for(domain.get()) != VirtualMachine::State::running) | ||
{ | ||
state_wait.notify_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the synchronization at this point? Nothing obvious was changed... IOW, what exactly was it that the shutdown had to wait for that is now ready? Or am I seeing this all wrong? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm trying to prevent is the daemon continuing it's operation until it's safe to so. The situation I'm trying to avoid is a multipass delete -p
on a starting instance and the daemon destroying the corresponding *VirtualMachine object while it's starting since the starting check is in another thread. This is a synchronization between the daemon thread and the async_wait_until_ssh_up() thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I understand the goal better now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, submitting replies as another review, to issue them in a single shot. And then actually adding one or two new points.
@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start) | |||
NiceMock<mpt::MockVMStatusMonitor> mock_monitor; | |||
auto machine = backend.create_virtual_machine(default_description, mock_monitor); | |||
|
|||
EXPECT_CALL(mock_monitor, persist_state_for(_)); | |||
EXPECT_CALL(mock_monitor, persist_state_for(_, _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would confirm that the transition left the machine in the right state.
virDomainDestroy(domain.get()); | ||
state = State::off; | ||
update_state(); | ||
state_wait.wait(lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK replying to that part there, but what about the first two points?
To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up
, it already hit the process_vm_events
/ensure_running
thing and is now doing mp::utils::try_action_for
(this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side
I've spent some time testing this, and have not managed to break it. It's definitely a step in the right direction |
OK, I just wrote a reply to clarify the sort of thing I had in mind in one of the review comments. As agreed in the meeting today, I am handing the review to Gerry now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second pass review, I'm happy to take this as it is. I can't force any functionality regression, and it seems stable.
Nice work!
bors r+
682: Daemon concurrent rpc operations r=gerboland a=townsend2010 Behavior of this branch (so far); 1. The check for ssh up in `launch`, `start`, and `restart`, should be asynchronous. 2. Issuing a `multipass delete -p <instance_name>` on a starting instance should be safe when using the qemu & libvirt backends. 3. Installing `sshfs` during `start` should be asynchronous like when defining a mount for the first time on a stopped instance. Left to do: 1. Make installing `sshfs` during the `mount` command asynchronous. 3. Making other operations asynchronous like preparing an image, downloading an image, uncompressing an image. Co-authored-by: Chris Townsend <[email protected]>
Bah, bors wants to use my comments in the, uh, comments section for the merge message. I'm going to fix that up and re-run bors. bors r- |
Canceled |
bors r=ricab,gerboland |
682: Daemon concurrent rpc operations r=ricab,gerboland a=townsend2010 Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations. Also with this, the check for detecting if ssh is up in the instance is now asynchronous. Fixes #643 Co-authored-by: Chris Townsend <[email protected]>
Build failed |
Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations.
Also with this, the check for detecting if ssh is up in the instance is now asynchronous.
Fixes #643