Daemon concurrent rpc operations #682

townsend2010 · 2019-03-14T20:46:02Z

Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations.

Also with this, the check for detecting if ssh is up in the instance is now asynchronous.

Fixes #643

gerboland · 2019-03-14T22:24:04Z

Overall I am happy with this and its direction.

I was hoping to avoid locking in VirtualMachine but for now it is clear it translates most directly and so is less likely to cause breakage. So +1 so far!

ricab

Very nice. I just had a quick look to get an overview (would need more study to understand it better). My only concerns at this point are some stuff that still needs locking (e.g. async_future_watchers) and the unscoped locking/unlocking approach (deadlock magnet IMO). https://en.cppreference.com/w/cpp/thread/scoped_lock is available in C++17

townsend2010 · 2019-03-15T12:22:44Z

Hey guys!

Thanks for the initial reviews. I agree that locking is rough, but it will be refined over time, ie, scoped locking of some sort (either c++17 scoped_lock or lock_guard in it's own scope block), who owns the mutex and does locking, etc.

Since it was mentioned, I don't think async_future_watchers needs any sort of locking. The only thing that modifies and uses it is the daemon thread itself. I do agree that there are probably other areas that need some locking...I just need to find those:)

codecov · 2019-03-25T21:01:07Z

Codecov Report

Merging #682 into master will decrease coverage by 0.04%.
The diff coverage is 35.34%.

@@            Coverage Diff             @@
##           master     #682      +/-   ##
==========================================
- Coverage   66.51%   66.46%   -0.05%     
==========================================
  Files         174      174              
  Lines        6042     6093      +51     
==========================================
+ Hits         4019     4050      +31     
- Misses       2023     2043      +20

Impacted Files	Coverage Δ
src/daemon/daemon_rpc.h	`0% <ø> (ø)`	⬆️
include/multipass/vm_status_monitor.h	`0% <ø> (ø)`	⬆️
include/multipass/virtual_machine.h	`0% <ø> (ø)`	⬆️
src/client/cmd/start.cpp	`70.21% <0%> (ø)`	⬆️
src/utils/utils.cpp	`77.57% <0%> (-1.48%)`	⬇️
src/daemon/daemon_rpc.cpp	`87.5% <100%> (+0.93%)`	⬆️
...tform/backends/libvirt/libvirt_virtual_machine.cpp	`42.66% <21.42%> (-2.27%)`	⬇️
src/daemon/daemon.cpp	`23.07% <28.22%> (+1.32%)`	⬆️
...rc/platform/backends/qemu/qemu_virtual_machine.cpp	`61.26% <41.66%> (-0.89%)`	⬇️
src/daemon/daemon.h	`50% <50%> (+16.66%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47a7af...25b2261. Read the comment docs.

townsend2010 · 2019-03-26T18:28:17Z

Ok, I think this is ready for review for the first iteration of concurrency.

gerboland

Code-wise, I think it looks great. Proof is in the testing, which I'll start tomorrow!

src/rpc/multipass.proto

ricab

Submitting a first review bit only. So far it looks really nice and I have only a couple detail proposals. I am far from wrapping my head around the whole thing though.

Right now having some trouble understanding the meaning of the async methods in the daemon. It would be helpful if you could provide a brief explanation of the intent of lines 151-167 in daemon.h? Also, why was the ServerContext dropped from RPC methods? Perhaps a better question would be why it was there in the first place? Thanks in advance.

src/daemon/daemon_rpc.cpp

src/daemon/daemon.h

townsend2010 · 2019-03-27T14:01:46Z

@ricab,

Thanks for the review. Yes, it's quite a piece of code:) I'll answer your general questions here.

It would be helpful if you could provide a brief explanation of the intent of lines 151-167 in daemon.h?

Sure, will do!

Also, why was the ServerContext dropped from RPC methods? Perhaps a better question would be why it was there in the first place?

Previously, the Daemon class inherited from the multipass::Rpc::Service class and so those methods were overrides of the Rpc class. But as you saw, we don't need to inherit from multipass::Rpc::Service and so the methods no longer become overrides of that class and thus, we can drop ServerContext since it's not necessary for our purposes.

ricab · 2019-03-27T15:08:24Z

Sure, will do!

I meant here or over IRC, just to ease reviewing. I did not mean in the code, although I am fine with that too if you prefer.

[...] as you saw, we don't need to inherit from multipass::Rpc::Service [...]

OK, so it was just a leftover param from the Rpc signatures that was propagated to all daemon stuff even thought it was never used, correct? We should drop -Wno-unused-parameter to avoid this sort of thing...

townsend2010 · 2019-03-27T15:21:23Z

Ok, I'll explain those lines here.

The AsyncOperationStatus struct is used to return the grpc Status along with the correct future that DaemonRpc created so the Daemon object can work on the correct operation when finish_async_operation is called.

async_wait_for_ssh_for is the base asynchronous method for the wait_until_ssh_up and only works on oneinstance.

async_wait_for_ssh_all is the asynchronous method for working on multiple instances and calls async_wait_for_ssh_for for each instance in a loop. The loop itself is synchronous but since it's done in a separate thread, it's asynchronous as far as the daemon is concerned.

async_wait_for_ssh_and_start_mounts is the asynchronous method for starting an instance and does the wait_until_ssh_up and then sets up any mounts including installing sshfs in the instance if it's missing.

I hope this clears it up a bit.

Regarding removing -Wno-unused-parameter, I think we still want that there because DaemonRpc still overrides the Rpc class and ServerContext is not used there as well.

ricab · 2019-03-27T15:39:48Z

OK, thanks for the explanation. So basically the *async* methods correspond to the long operations that should be offloaded, right?

On unused parameter, replying in #697

townsend2010 · 2019-03-27T15:41:28Z

So basically the async methods correspond to the long operations that should be offloaded, right?

Yep, exactly.

This will allow the daemon to spawn off long running operations and then service any new operations requested by the RPC. Fixes #643

This is due to making the affected functions no longer override the grpc Server class functions.

Also makes mounting in start async.

Also commonize the creation of the QFutureWatcher.

Make Daemon command functions virtual for mocking. Add default mock method calls to set the value of the promise.

…back into vm object

gerboland · 2019-04-03T13:43:12Z

I found a misbehaviour:

Terminal 1: Run multipass launch --name x xenial which started downloading the image (slow).
Terminal 2: Run multipass launch --name y xenial. It shows me "Creating y" spinner, then fails with

launch failed: failed to download from 'http://cloud-images.ubuntu.com/releases/server/releases/xenial/release-20190320/ubuntu-16.04-server-cloudimg-amd64-disk1.img': Network timeout

because the download in Terminal 1 was still in progress.

Terminal 1 download then failed with "launch failed: Cannot open image file for computing hash"

So maybe we need to check if 2 VMs are trying to launch simultaneously using the same download, and delay things?

townsend2010 · 2019-04-03T13:52:55Z

@gerboland,

I kind of think this related to an already existing bug, #664. And also, the downloading of images part has not been touched by this PR. I plan on making the image downloads better by making them asynchronous and also only try to download once. I'd really like to defer that work though for the next iteration of this.

This PR is more about making list, etc. work while waiting for ssh to be ready during launch, start, and restart.

gerboland · 2019-04-03T14:01:21Z

@townsend2010 yep fine, I'm not asking for everything to be perfect in this MP.

ricab

I did not finish the review completely, but this is most of it, so I think we can iterate on it already. I have a bunch of questions and a few requests.

ricab · 2019-04-03T15:26:09Z

tests/test_libvirt_backend.cpp

@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start)
    NiceMock<mpt::MockVMStatusMonitor> mock_monitor;
    auto machine = backend.create_virtual_machine(default_description, mock_monitor);

-    EXPECT_CALL(mock_monitor, persist_state_for(_));
+    EXPECT_CALL(mock_monitor, persist_state_for(_, _));


Might as well check the parameter is right in updated EXPECT_CALLs

I suppose we could, but will it gain us anything?

It would confirm that the transition left the machine in the right state.

ricab · 2019-04-03T16:03:59Z

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

+        virDomainDestroy(domain.get());
+        state = State::off;
+        update_state();
+        state_wait.wait(lock);


I am a bit confused about this synchronization bit, but let me try to explain my concerns:

Assuming this does need to wait for the notification, it needs guarding against "spurious wakes" (see item 3 on waiting here. IOW, the caller needs to check that wake was legit, which can be done by passing a predicate to wait (2nd form).

But then we can deadlock, namely when we get into this if and the starting thread meanwhile moves the state to running (suppose it was just about to do it when we get here)...

But, actually, why is the wait needed here in the first place? See question below on notify_all

See my answer in #682 (comment).

OK replying to that part there, but what about the first two points?

To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up, it already hit the process_vm_events/ensure_running thing and is now doing mp::utils::try_action_for (this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side

Oh, forgot to address the other 2 parts:)

So, for spurious wakes, yes, I need to add a predicate. I forgot about that, so will do. Thanks!

For deadlocks, I'm unsure of your scenario. There are also locks in the action and on_timeout lambdas in mp::utils::wait_until_ssh_up() so state won't be set or read unless the lock is being held. I'll think about this some more just to make sure I'm not missing something.

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

ricab · 2019-04-03T16:55:07Z

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

+    std::lock_guard<decltype(state_mutex)> lock{state_mutex};
+    if (domain_state_for(domain.get()) != VirtualMachine::State::running)
+    {
+        state_wait.notify_all();


Why do we need the synchronization at this point? Nothing obvious was changed... IOW, what exactly was it that the shutdown had to wait for that is now ready? Or am I seeing this all wrong? What am I missing?

What I'm trying to prevent is the daemon continuing it's operation until it's safe to so. The situation I'm trying to avoid is a multipass delete -p on a starting instance and the daemon destroying the corresponding *VirtualMachine object while it's starting since the starting check is in another thread. This is a synchronization between the daemon thread and the async_wait_until_ssh_up() thread.

OK, I think I understand the goal better now

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

src/platform/backends/qemu/qemu_virtual_machine.cpp

src/daemon/daemon.cpp

ricab

OK, submitting replies as another review, to issue them in a single shot. And then actually adding one or two new points.

ricab · 2019-04-04T09:05:47Z

tests/test_libvirt_backend.cpp

@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start)
    NiceMock<mpt::MockVMStatusMonitor> mock_monitor;
    auto machine = backend.create_virtual_machine(default_description, mock_monitor);

-    EXPECT_CALL(mock_monitor, persist_state_for(_));
+    EXPECT_CALL(mock_monitor, persist_state_for(_, _));


It would confirm that the transition left the machine in the right state.

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

src/daemon/daemon.cpp

ricab · 2019-04-04T10:22:49Z

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

+        virDomainDestroy(domain.get());
+        state = State::off;
+        update_state();
+        state_wait.wait(lock);


OK replying to that part there, but what about the first two points?

To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up, it already hit the process_vm_events/ensure_running thing and is now doing mp::utils::try_action_for (this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side

src/platform/backends/libvirt/libvirt_virtual_machine.cpp

src/platform/backends/qemu/qemu_virtual_machine.cpp

gerboland · 2019-04-04T13:51:55Z

I've spent some time testing this, and have not managed to break it. It's definitely a step in the right direction

ricab · 2019-04-04T16:08:22Z

OK, I just wrote a reply to clarify the sort of thing I had in mind in one of the review comments. As agreed in the meeting today, I am handing the review to Gerry now.

gerboland

Second pass review, I'm happy to take this as it is. I can't force any functionality regression, and it seems stable.
Nice work!
bors r+

src/platform/backends/qemu/qemu_virtual_machine.cpp

682: Daemon concurrent rpc operations r=gerboland a=townsend2010 Behavior of this branch (so far); 1. The check for ssh up in `launch`, `start`, and `restart`, should be asynchronous. 2. Issuing a `multipass delete -p <instance_name>` on a starting instance should be safe when using the qemu & libvirt backends. 3. Installing `sshfs` during `start` should be asynchronous like when defining a mount for the first time on a stopped instance. Left to do: 1. Make installing `sshfs` during the `mount` command asynchronous. 3. Making other operations asynchronous like preparing an image, downloading an image, uncompressing an image. Co-authored-by: Chris Townsend <[email protected]>

townsend2010 · 2019-04-05T17:02:38Z

Bah, bors wants to use my comments in the, uh, comments section for the merge message. I'm going to fix that up and re-run bors.

bors r-

bors · 2019-04-05T17:02:39Z

Canceled

townsend2010 · 2019-04-05T17:06:12Z

bors r=ricab,gerboland

682: Daemon concurrent rpc operations r=ricab,gerboland a=townsend2010 Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations. Also with this, the check for detecting if ssh is up in the instance is now asynchronous. Fixes #643 Co-authored-by: Chris Townsend <[email protected]>

bors · 2019-04-05T17:34:12Z

Build failed

continuous-integration/travis-ci/push

townsend2010 added the no-merge label Mar 14, 2019

ricab reviewed Mar 15, 2019

View reviewed changes

townsend2010 force-pushed the daemon-concurrent-rpc-operations branch 2 times, most recently from 0ae6811 to 1d66af9 Compare March 25, 2019 20:22

townsend2010 removed the no-merge label Mar 26, 2019

townsend2010 changed the title ~~WIP: Daemon concurrent rpc operations~~ Daemon concurrent rpc operations Mar 26, 2019

ricab self-requested a review March 26, 2019 19:23

gerboland reviewed Mar 26, 2019

View reviewed changes

src/rpc/multipass.proto Show resolved Hide resolved

ricab requested changes Mar 27, 2019

View reviewed changes

src/daemon/daemon_rpc.cpp Show resolved Hide resolved

src/daemon/daemon_rpc.cpp Outdated Show resolved Hide resolved

src/daemon/daemon.h Show resolved Hide resolved

src/daemon/daemon.h Outdated Show resolved Hide resolved

ricab mentioned this pull request Mar 27, 2019

Activate compiler warnings on unused parameters #697

Closed

Chris Townsend added 6 commits April 2, 2019 15:43

daemon/rpc: Use a promise/future model for allowing more asynchronicity

a1516c5

This will allow the daemon to spawn off long running operations and then service any new operations requested by the RPC. Fixes #643

daemon/rpc: Remove unneeded ServerContext parameter

3467fe4

This is due to making the affected functions no longer override the grpc Server class functions.

daemon: Make wait_until_ssh_up() async in launch and start

8fe3fc4

Also makes mounting in start async.

daemon: Make wait_until_ssh_up() async in restart operations

98b6691

Also commonize the creation of the QFutureWatcher.

backends/qemu: Add state synchronization for starting and shutdown

42af446

tests/client: Add StubDaemonRpc for testing the cli

05a8630

Chris Townsend added 7 commits April 2, 2019 15:46

tests/daemon: Change mocks to match base Daemon class

b2461ef

Make Daemon command functions virtual for mocking. Add default mock method calls to set the value of the promise.

daemon: Account for new create command merged in wrt concurrency

cf6ec5c

tests: Fix the last of the tests regarding 'create' and concurrency

1d5edab

backends/libvirt: Add critical section locks

27518b1

daemon: Pass state into persist_state_for() to avoid unnecessary call…

718cdf9

…back into vm object

daemon: Small fixes/changes based on review feedback

52b3133

utils: Move lock in wait_until_ssh_up() to avoid deadlock

e59f178

townsend2010 force-pushed the daemon-concurrent-rpc-operations branch from 3ae71ac to e59f178 Compare April 2, 2019 19:53

ricab requested changes Apr 3, 2019

View reviewed changes

ricab reviewed Apr 4, 2019

View reviewed changes

Chris Townsend added 3 commits April 4, 2019 14:06

daemon: Simplify async by using just one commom method

779e221

backends: Use cond var wait predicate to handle spurious wakes

8dc8499

daemon/async: Change where the check for update is done for 'start'

25b2261

gerboland approved these changes Apr 5, 2019

View reviewed changes

src/platform/backends/qemu/qemu_virtual_machine.cpp Show resolved Hide resolved

src/platform/backends/qemu/qemu_virtual_machine.cpp Show resolved Hide resolved

townsend2010 merged commit 25b2261 into master Apr 5, 2019

bors bot deleted the daemon-concurrent-rpc-operations branch April 5, 2019 17:35

Daemon concurrent rpc operations #682

Daemon concurrent rpc operations #682

Conversation

townsend2010 commented Mar 14, 2019 • edited Loading

gerboland commented Mar 14, 2019

ricab left a comment

Choose a reason for hiding this comment

townsend2010 commented Mar 15, 2019

codecov bot commented Mar 25, 2019 • edited Loading

Codecov Report

townsend2010 commented Mar 26, 2019

gerboland left a comment

Choose a reason for hiding this comment

ricab left a comment

Choose a reason for hiding this comment

townsend2010 commented Mar 27, 2019

ricab commented Mar 27, 2019 • edited Loading

townsend2010 commented Mar 27, 2019

ricab commented Mar 27, 2019

townsend2010 commented Mar 27, 2019

gerboland commented Apr 3, 2019

townsend2010 commented Apr 3, 2019

gerboland commented Apr 3, 2019

ricab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricab Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerboland commented Apr 4, 2019

ricab commented Apr 4, 2019

gerboland left a comment

Choose a reason for hiding this comment

townsend2010 commented Apr 5, 2019

bors bot commented Apr 5, 2019

Canceled

townsend2010 commented Apr 5, 2019

bors bot commented Apr 5, 2019

Build failed

townsend2010 commented Mar 14, 2019 •

edited

Loading

codecov bot commented Mar 25, 2019 •

edited

Loading

ricab commented Mar 27, 2019 •

edited

Loading

ricab Apr 3, 2019 •

edited

Loading