Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can processes be 'detached' from the runtime? #1507

Closed
fabiand opened this issue Jul 5, 2017 · 7 comments
Closed

Can processes be 'detached' from the runtime? #1507

fabiand opened this issue Jul 5, 2017 · 7 comments

Comments

@fabiand
Copy link

fabiand commented Jul 5, 2017

We have a use case - libvirt in a container - where we would like to keep the processes (qemu, the VMs) running if the container goes down.

The whole point of containers is to contain processes, I'd still like to understand if it would be possible that processes (not pid1 in the container) could be kept running, even if the container runtime goes away.
I'm explicitly saying container runtime here, because I could imagine that the "container" might be able to outlive it's runtime, because it's based on kernel objects (namespaces, cgroups).

But I'm likely wrong, and thus would like to clarify if something like this would be possible?

@cyphar
Copy link
Member

cyphar commented Jul 5, 2017

I'm not sure what you mean.

runc is designed so that you can run it in a detached mode, where the runc process will not exist after the container has started -- effectively your container process is the only thing running (and management of containers works as normal). In fact, the non-detached mode of runc is just the detached mode with runc acting as a dumb pipe and signal forwarder.

Does that answer your question? If you're asking about killing pid1 in a container but keeping the container namespaces around that's not possible due to kernel limitations (well specifically the PID namespace).

@fabiand
Copy link
Author

fabiand commented Jul 6, 2017

Thanks for the quick reply.

I actually saw -d but failed to get it working reliably. After your reply I looked at it again, and played with termainl: true which got me a little further, but lead me to #1218 which is about to what attach std{in,out,err} to IIUIC.

I think we can close this issue, one last thing I'd like to understand is if - in the detached case - the 'container' is now solely kernel-space objects, i.e. just processes with namespaces and cgroups? or if there is still a user-space process which is supervising the container.
IIUIC then this is not the case, I'm rather try to get this confirmed.

@cyphar
Copy link
Member

cyphar commented Jul 6, 2017

I actually saw -d but failed to get it working reliably.

Unfortunately the on-boarding for this is much harder than necessary. If you want to have a new pty created for your container (terminal: true), you need to have something hold open the master end (that's just how ptys work) -- this is a restriction of the kernel that is not trivial to get around. We have a sample implementation of a daemon that will keep the pty open for you, but that's just used for testing (and if you needed this you would know).

However, if you just want to pass it the file descriptors it should use for stdin, stdout and stderr you can do that just by changing the stdio file descriptors for runc when you do runc create or runc run -d (with terminal: false). This is how I would recommend doing this.

% sed -i 's/"terminal": true/"terminal": false/g' config.json
% runc run -d ctr 0</dev/null 1>/dev/null 2>/dev/null
[running in background]

You can create some fifos or pipes and then just do everything that way. #1218 is entirely unrelated to this, it's an issue with checkpointing that I believe might actually have been fixed recently.

I am very sad that this is not documented well at all. In our defense, a lot of these semantics are quite new and took us quite a while to agree on (and I think we still disagree about how users should be forced to use it).

the 'container' is now solely kernel-space objects, i.e. just processes with namespaces and cgroups?

That is correct. When I said "where the runc process will not exist after the container has started" I meant it. We only spawn processes necessary to set up the container, and after the init process for the container has started none of our code is running anymore. When managing containers we do everything using /proc and the pid of the pid1 in the container (which we store in a state directory).

@fabiand
Copy link
Author

fabiand commented Jul 6, 2017

Yes, I saw hat runtime informations are kep in some run dir. I like this, just my 2ct on that :)
And no worries about the documentation, and I do understand that there arew actually a few ways of how to understand/define the semantics.

So - It seems that we need to keep open the fds. And it actually sounds as if we do need a user-space component for this. But IIUCI then, if this user-space component holding the fds dies, the container will die as well.
fifos/pipes have their limitations IIUIC - as they have their (byte) limits.

Which brings me back to say that we do need to keep some fd open somewhere in user-space (fifo, pipe, process) to allow the processes to run. Correct?

I wonder, how is it done by ocid - does ocid provide the fds used by runc?

@cyphar
Copy link
Member

cyphar commented Jul 7, 2017

So - It seems that we need to keep open the fds. And it actually sounds as if we do need a user-space component for this.

Only if you want to have terminal: true.

Which brings me back to say that we do need to keep some fd open somewhere in user-space (fifo, pipe, process) to allow the processes to run. Correct?

[Note: fifos and pipes aren't in userspace]

It is correct that if you want to have terminal: true then you need to have something that holds the master end of the pty. This shouldn't be an issue, otherwise you would just dup /dev/null if you didn't care about the output (in other words, in order for you to be able to read from a terminal you need to have the terminal fd open). I recognise that this seems inelegant but it's due to kernel restrictions, and if you think about it for long enough you realise that it's not actually a restriction (with AF_UNIX sockets you can pass around the fd as much as you want).

I have some ideas about how you might be able to fix this issue in the future (it requires some not-very nice hacks and kernel patches) but at the moment it's not possible. There are two main things it boils down to:

  • Mount namespaces are not hierarchical and you can't use NS_GET_PARENT with them.
  • open of a master pty fails with ENXIO.

I'm working on solving both of these problems in the kernel.

I wonder, how is it done by ocid - does ocid provide the fds used by runc?

crio has a manager for each container called conmon which keeps open the fds as necessary and creates them as necessary (so the code is the same in both the terminal: true and terminal: false). I tried, but unfortunately you can't just use some files as the fds for the container process directly (kubelet requires you to implement attaching as well as logging). I am working on making the manager nicer (and Alexander Laarson is doing a great job of cleaning up the horrible C code that it used to be) but it's an unfortunate restriction that I'm not sure is possible to get around.

@fabiand
Copy link
Author

fabiand commented Jul 11, 2017

Yes, you are obviously right that fifos and pipes are not in user-space.
And nice to hear that you look how to solve this in the kernel land.

So, in crio - it's still the case that if the crio crashes, that the processes/containers will die, because the fd got closed?

And feel free to close this issue.

@cyphar
Copy link
Member

cyphar commented Jul 11, 2017

In crio a new conmon is spawned for every container. In order for any container to stay alive you need to keep its conmon alive. The crio daemon itself can be killed at any point (you only need it alive to manage conmon instances).

Closing. Feel free to ask in the crio issue tracker if you have any more questions about crio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants