Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon concurrent rpc operations #682

Merged
merged 16 commits into from
Apr 5, 2019

Conversation

townsend2010
Copy link
Contributor

@townsend2010 townsend2010 commented Mar 14, 2019

Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations.

Also with this, the check for detecting if ssh is up in the instance is now asynchronous.

Fixes #643

@gerboland
Copy link
Contributor

Overall I am happy with this and its direction.

I was hoping to avoid locking in VirtualMachine but for now it is clear it translates most directly and so is less likely to cause breakage. So +1 so far!

Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I just had a quick look to get an overview (would need more study to understand it better). My only concerns at this point are some stuff that still needs locking (e.g. async_future_watchers) and the unscoped locking/unlocking approach (deadlock magnet IMO). https://en.cppreference.com/w/cpp/thread/scoped_lock is available in C++17

@townsend2010
Copy link
Contributor Author

Hey guys!

Thanks for the initial reviews. I agree that locking is rough, but it will be refined over time, ie, scoped locking of some sort (either c++17 scoped_lock or lock_guard in it's own scope block), who owns the mutex and does locking, etc.

Since it was mentioned, I don't think async_future_watchers needs any sort of locking. The only thing that modifies and uses it is the daemon thread itself. I do agree that there are probably other areas that need some locking...I just need to find those:)

@townsend2010 townsend2010 force-pushed the daemon-concurrent-rpc-operations branch 2 times, most recently from 0ae6811 to 1d66af9 Compare March 25, 2019 20:22
@codecov
Copy link

codecov bot commented Mar 25, 2019

Codecov Report

Merging #682 into master will decrease coverage by 0.04%.
The diff coverage is 35.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #682      +/-   ##
==========================================
- Coverage   66.51%   66.46%   -0.05%     
==========================================
  Files         174      174              
  Lines        6042     6093      +51     
==========================================
+ Hits         4019     4050      +31     
- Misses       2023     2043      +20
Impacted Files Coverage Δ
src/daemon/daemon_rpc.h 0% <ø> (ø) ⬆️
include/multipass/vm_status_monitor.h 0% <ø> (ø) ⬆️
include/multipass/virtual_machine.h 0% <ø> (ø) ⬆️
src/client/cmd/start.cpp 70.21% <0%> (ø) ⬆️
src/utils/utils.cpp 77.57% <0%> (-1.48%) ⬇️
src/daemon/daemon_rpc.cpp 87.5% <100%> (+0.93%) ⬆️
...tform/backends/libvirt/libvirt_virtual_machine.cpp 42.66% <21.42%> (-2.27%) ⬇️
src/daemon/daemon.cpp 23.07% <28.22%> (+1.32%) ⬆️
...rc/platform/backends/qemu/qemu_virtual_machine.cpp 61.26% <41.66%> (-0.89%) ⬇️
src/daemon/daemon.h 50% <50%> (+16.66%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47a7af...25b2261. Read the comment docs.

@townsend2010
Copy link
Contributor Author

Ok, I think this is ready for review for the first iteration of concurrency.

@townsend2010 townsend2010 changed the title WIP: Daemon concurrent rpc operations Daemon concurrent rpc operations Mar 26, 2019
@ricab ricab self-requested a review March 26, 2019 19:23
Copy link
Contributor

@gerboland gerboland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, I think it looks great. Proof is in the testing, which I'll start tomorrow!

src/rpc/multipass.proto Show resolved Hide resolved
Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting a first review bit only. So far it looks really nice and I have only a couple detail proposals. I am far from wrapping my head around the whole thing though.

Right now having some trouble understanding the meaning of the async methods in the daemon. It would be helpful if you could provide a brief explanation of the intent of lines 151-167 in daemon.h? Also, why was the ServerContext dropped from RPC methods? Perhaps a better question would be why it was there in the first place? Thanks in advance.

src/daemon/daemon_rpc.cpp Show resolved Hide resolved
src/daemon/daemon_rpc.cpp Outdated Show resolved Hide resolved
src/daemon/daemon.h Show resolved Hide resolved
src/daemon/daemon.h Outdated Show resolved Hide resolved
@townsend2010
Copy link
Contributor Author

@ricab,

Thanks for the review. Yes, it's quite a piece of code:) I'll answer your general questions here.

It would be helpful if you could provide a brief explanation of the intent of lines 151-167 in daemon.h?

Sure, will do!

Also, why was the ServerContext dropped from RPC methods? Perhaps a better question would be why it was there in the first place?

Previously, the Daemon class inherited from the multipass::Rpc::Service class and so those methods were overrides of the Rpc class. But as you saw, we don't need to inherit from multipass::Rpc::Service and so the methods no longer become overrides of that class and thus, we can drop ServerContext since it's not necessary for our purposes.

@ricab
Copy link
Collaborator

ricab commented Mar 27, 2019

Sure, will do!

I meant here or over IRC, just to ease reviewing. I did not mean in the code, although I am fine with that too if you prefer.

[...] as you saw, we don't need to inherit from multipass::Rpc::Service [...]

OK, so it was just a leftover param from the Rpc signatures that was propagated to all daemon stuff even thought it was never used, correct? We should drop -Wno-unused-parameter to avoid this sort of thing...

@townsend2010
Copy link
Contributor Author

Ok, I'll explain those lines here.

The AsyncOperationStatus struct is used to return the grpc Status along with the correct future that DaemonRpc created so the Daemon object can work on the correct operation when finish_async_operation is called.

async_wait_for_ssh_for is the base asynchronous method for the wait_until_ssh_up and only works on oneinstance.

async_wait_for_ssh_all is the asynchronous method for working on multiple instances and calls async_wait_for_ssh_for for each instance in a loop. The loop itself is synchronous but since it's done in a separate thread, it's asynchronous as far as the daemon is concerned.

async_wait_for_ssh_and_start_mounts is the asynchronous method for starting an instance and does the wait_until_ssh_up and then sets up any mounts including installing sshfs in the instance if it's missing.

I hope this clears it up a bit.

Regarding removing -Wno-unused-parameter, I think we still want that there because DaemonRpc still overrides the Rpc class and ServerContext is not used there as well.

@ricab
Copy link
Collaborator

ricab commented Mar 27, 2019

OK, thanks for the explanation. So basically the *async* methods correspond to the long operations that should be offloaded, right?

On unused parameter, replying in #697

@townsend2010
Copy link
Contributor Author

So basically the async methods correspond to the long operations that should be offloaded, right?

Yep, exactly.

Chris Townsend added 6 commits April 2, 2019 15:43
This will allow the daemon to spawn off long running operations and then service any
new operations requested by the RPC.

Fixes #643
This is due to making the affected functions no longer override the grpc Server class functions.
Also commonize the creation of the QFutureWatcher.
@townsend2010 townsend2010 force-pushed the daemon-concurrent-rpc-operations branch from 3ae71ac to e59f178 Compare April 2, 2019 19:53
@gerboland
Copy link
Contributor

I found a misbehaviour:

Terminal 1: Run multipass launch --name x xenial which started downloading the image (slow).
Terminal 2: Run multipass launch --name y xenial. It shows me "Creating y" spinner, then fails with

launch failed: failed to download from 'http://cloud-images.ubuntu.com/releases/server/releases/xenial/release-20190320/ubuntu-16.04-server-cloudimg-amd64-disk1.img': Network timeout

because the download in Terminal 1 was still in progress.

Terminal 1 download then failed with "launch failed: Cannot open image file for computing hash"

So maybe we need to check if 2 VMs are trying to launch simultaneously using the same download, and delay things?

@townsend2010
Copy link
Contributor Author

@gerboland,

I kind of think this related to an already existing bug, #664. And also, the downloading of images part has not been touched by this PR. I plan on making the image downloads better by making them asynchronous and also only try to download once. I'd really like to defer that work though for the next iteration of this.

This PR is more about making list, etc. work while waiting for ssh to be ready during launch, start, and restart.

@gerboland
Copy link
Contributor

@townsend2010 yep fine, I'm not asking for everything to be perfect in this MP.

Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not finish the review completely, but this is most of it, so I think we can iterate on it already. I have a bunch of questions and a few requests.

@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start)
NiceMock<mpt::MockVMStatusMonitor> mock_monitor;
auto machine = backend.create_virtual_machine(default_description, mock_monitor);

EXPECT_CALL(mock_monitor, persist_state_for(_));
EXPECT_CALL(mock_monitor, persist_state_for(_, _));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well check the parameter is right in updated EXPECT_CALLs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could, but will it gain us anything?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would confirm that the transition left the machine in the right state.

virDomainDestroy(domain.get());
state = State::off;
update_state();
state_wait.wait(lock);
Copy link
Collaborator

@ricab ricab Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused about this synchronization bit, but let me try to explain my concerns:

  • Assuming this does need to wait for the notification, it needs guarding against "spurious wakes" (see item 3 on waiting here. IOW, the caller needs to check that wake was legit, which can be done by passing a predicate to wait (2nd form).
  • But then we can deadlock, namely when we get into this if and the starting thread meanwhile moves the state to running (suppose it was just about to do it when we get here)...
  • But, actually, why is the wait needed here in the first place? See question below on notify_all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my answer in #682 (comment).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK replying to that part there, but what about the first two points?

To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up, it already hit the process_vm_events/ensure_running thing and is now doing mp::utils::try_action_for (this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to address the other 2 parts:)

So, for spurious wakes, yes, I need to add a predicate. I forgot about that, so will do. Thanks!

For deadlocks, I'm unsure of your scenario. There are also locks in the action and on_timeout lambdas in mp::utils::wait_until_ssh_up() so state won't be set or read unless the lock is being held. I'll think about this some more just to make sure I'm not missing something.

std::lock_guard<decltype(state_mutex)> lock{state_mutex};
if (domain_state_for(domain.get()) != VirtualMachine::State::running)
{
state_wait.notify_all();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the synchronization at this point? Nothing obvious was changed... IOW, what exactly was it that the shutdown had to wait for that is now ready? Or am I seeing this all wrong? What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to prevent is the daemon continuing it's operation until it's safe to so. The situation I'm trying to avoid is a multipass delete -p on a starting instance and the daemon destroying the corresponding *VirtualMachine object while it's starting since the starting check is in another thread. This is a synchronization between the daemon thread and the async_wait_until_ssh_up() thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand the goal better now

src/daemon/daemon.cpp Show resolved Hide resolved
Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, submitting replies as another review, to issue them in a single shot. And then actually adding one or two new points.

@@ -193,7 +193,7 @@ TEST_F(LibVirtBackend, machine_persists_and_sets_state_on_start)
NiceMock<mpt::MockVMStatusMonitor> mock_monitor;
auto machine = backend.create_virtual_machine(default_description, mock_monitor);

EXPECT_CALL(mock_monitor, persist_state_for(_));
EXPECT_CALL(mock_monitor, persist_state_for(_, _));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would confirm that the transition left the machine in the right state.

src/daemon/daemon.cpp Show resolved Hide resolved
src/daemon/daemon.cpp Show resolved Hide resolved
virDomainDestroy(domain.get());
state = State::off;
update_state();
state_wait.wait(lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK replying to that part there, but what about the first two points?

To clarify the second one, consider this sequence: the other thread is running mp::utils::wait_until_ssh_up, it already hit the process_vm_events/ensure_running thing and is now doing mp::utils::try_action_for (this cycle may have happened a number of times already); this thread gets here and waits; ssh successfully comes up in the other thread; this thread is now left waiting indefinitely... Perhaps a final call to ensure running is required on the other side

@gerboland
Copy link
Contributor

I've spent some time testing this, and have not managed to break it. It's definitely a step in the right direction

@ricab
Copy link
Collaborator

ricab commented Apr 4, 2019

OK, I just wrote a reply to clarify the sort of thing I had in mind in one of the review comments. As agreed in the meeting today, I am handing the review to Gerry now.

Copy link
Contributor

@gerboland gerboland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass review, I'm happy to take this as it is. I can't force any functionality regression, and it seems stable.
Nice work!
bors r+

bors bot added a commit that referenced this pull request Apr 5, 2019
682: Daemon concurrent rpc operations r=gerboland a=townsend2010

Behavior of this branch (so far);
1. The check for ssh up in `launch`, `start`, and `restart`, should be asynchronous.
2. Issuing a `multipass delete -p <instance_name>` on a starting instance should be safe when using the qemu & libvirt backends.
3. Installing `sshfs` during `start` should be asynchronous like when defining a mount for the first time on a stopped instance.

Left to do:
1. Make installing `sshfs` during the `mount` command asynchronous.
3. Making other operations asynchronous like preparing an image, downloading an image, uncompressing an image.

Co-authored-by: Chris Townsend <[email protected]>
@townsend2010
Copy link
Contributor Author

Bah, bors wants to use my comments in the, uh, comments section for the merge message. I'm going to fix that up and re-run bors.

bors r-

@bors
Copy link
Contributor

bors bot commented Apr 5, 2019

Canceled

@townsend2010
Copy link
Contributor Author

bors r=ricab,gerboland

bors bot added a commit that referenced this pull request Apr 5, 2019
682: Daemon concurrent rpc operations r=ricab,gerboland a=townsend2010

Initial framework for making the daemon much more concurrent in how it handles rpc calls and the associated operations.

Also with this, the check for detecting if ssh is up in the instance is now asynchronous.

Fixes #643 

Co-authored-by: Chris Townsend <[email protected]>
@bors
Copy link
Contributor

bors bot commented Apr 5, 2019

Build failed

@townsend2010 townsend2010 merged commit 25b2261 into master Apr 5, 2019
@bors bors bot deleted the daemon-concurrent-rpc-operations branch April 5, 2019 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants