Skip to content
This repository has been archived by the owner on Apr 3, 2018. It is now read-only.

Use macvlan to connect the VM to the container network #393

Merged
merged 2 commits into from
Oct 30, 2017

Conversation

mcastelino
Copy link
Collaborator

@mcastelino mcastelino commented Oct 3, 2017

Networking: Add support for multi-queue mactap

Provide multiple methods to connect the Virtual machine
to the container network. The current implementation allows
this to be chosen at a node level. In the future we can
enhance this to be dynamic, where the container interface
type is used to determine the optimal interconnection method.

Add support for multi-queue macvtap as an alternate means to
connect the container network interface to the virtual machine.

Depends on: ciao-project/ciao#1510
Depends on: #410

Signed-off-by: Manohar Castelino [email protected]

network.go Outdated
newLink = &netlink.Macvtap{
LinkAttrs: netlink.LinkAttrs{Name: name},
Mode: netlink.MACVLAN_MODE_BRIDGE,
TxQLen: 500,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 500? Would an admin ever need to change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel this is the default value that has to be set for packets to flow. The C netlink library sets this. The go netlink library does not and hence needs to be set. No the admin will not need to change this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel changed this to pick the value from the parent. This is the best way to avoid the manifest constant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks.

network.go Outdated
case ModelMacVtap:
return tapNetworkPair(netPair)
case ModelEnlightened:
return fmt.Errorf("Unsupported networking model")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add DefaultNetInterworkingModel to the error here and in the error below. Also, due to the similarity of this function and disconnectVMNetwork, I wonder if it would be worth making both these functions call a new function that both validates netPair and either calls the "connect" or "disconnect" function:

func toggleVMNetwork(netPair *NetworkInterfacePair, connect bool) error

In fact, with the above function, we wouldn't really need connectVMNetwork and disconnectVMNetwork as the intent would be clear from the connect param.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel addressed.

@amshinde we need to make this even more generic to include SRIOV, VFIO and vhost-user. So having an pair abstraction is not ideal. We need a more generic abstraction that support all models. I will rework this based on VFIO and vhost-user needs

network.go Outdated

vethLink, err := getLinkByName(netHandle, netPair.VirtIface.Name, &netlink.Veth{})
if err != nil {
return fmt.Errorf("Could not get veth interface: %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the error include details from netPair or is it worth adding this to the error (same question to the errors returned below)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more details to both the parent error and here. The message does not provide enough debug information otherwise.

network.go Outdated
// the one inside the VM in order to avoid any firewall issues. The
// bridge created by the network plugin on the host actually expects
// to see traffic from this MAC address and not another one.
netPair.TAPIface.HardAddr = vethLinkAttrs.HardwareAddr.String()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds pretty significant - does it justify a log call?

network.go Outdated
netPair.VirtIface.HardAddr, netPair.VirtIface.Name, err)
}

mcastSnoop := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this won't ever need to be true, but can you add a comment explaining why this value is being passed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround for a kernel bug. Let me dig out the bug details and add it here.

@mcastelino mcastelino force-pushed the topic/macvtap branch 3 times, most recently from ee60be3 to b68a499 Compare October 5, 2017 00:44
network.go Outdated

tapDev := fmt.Sprintf("/dev/tap%d", i)

f, err := os.OpenFile(tapDev, os.O_RDWR, 0666)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you define the perms as a const at the top of the file? Also, does it absolutely have to be 0666?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this. What perms do we use for all the sockets etc we create in CC runtime?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy uses 0660:

... but for hypervisor sockets we rely on the defaults provided by qemu fwics (there doesn't appear to be a way to request particular perms from qemu for chardev devices). Taking a look at a pod:

$ sudo ls -ld /run/virtcontainers                                                                     
d--------- 3 root root 60 Oct  9 09:09 /run/virtcontainers
$ sudo ls -ld /run/virtcontainers/pods
d--------- 4 root root 80 Oct  9 11:15 /run/virtcontainers/pods
$ sudo ls -ld /run/virtcontainers/pods/
total 0
d--------- 3 root root 220 Oct  9 11:15 2128c38aa804ac73045e32c737914a6e6c36e33daa80ba6669d989a72e81652b
d--------- 3 root root 220 Oct  9 09:09 412db0928eccbe5a180a5e4be7577457d25fa845b3504129fec014ab89ae375e
$ sudo ls -l /run/virtcontainers/pods/2128c38aa804ac73045e32c737914a6e6c36e33daa80ba6669d989a72e81652b
total 8
d--------- 2 root root 100 Oct  9 11:15 2128c38aa804ac73045e32c737914a6e6c36e33daa80ba6669d989a72e81652b
srwxr-x--- 1 root root   0 Oct  9 11:15 console.sock
srwxr-x--- 1 root root   0 Oct  9 11:15 ctrl.sock
srwxr-x--- 1 root root   0 Oct  9 11:15 hyper.sock
-rw-r--r-- 1 root root   0 Oct  9 11:15 lock
srwxr-x--- 1 root root   0 Oct  9 11:15 monitor.sock
-rw-r--r-- 1 root root 514 Oct  9 11:15 network.json
-rw-r--r-- 1 root root 149 Oct  9 11:15 state.json
srwxr-x--- 1 root root   0 Oct  9 11:15 tty.sock
$ 

Aside: I wonder if we want to drop read perms from those state files (and lock).

/cc @grahamwhaley, @sameo.

@mcastelino mcastelino force-pushed the topic/macvtap branch 4 times, most recently from d98bfe8 to b09c912 Compare October 10, 2017 23:19
@mcastelino mcastelino changed the title WIP: Supporting multiple ways to connect the VM to the container network WIP: Use macvlan to connect the VM to the container network Oct 11, 2017
@mcastelino mcastelino force-pushed the topic/macvtap branch 2 times, most recently from 0d026f2 to f38cf2a Compare October 11, 2017 23:46
Copy link
Collaborator

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content sounds pretty good :)
I have some structural comments though
Oh and I think you should add some tests in order to prevent the code coverage to decrease.

const (
defaultRouteDest = "0.0.0.0/0"
defaultRouteLabel = "default"
defaultFilePerms = 0600
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that on purpose this file permission will only allow for root ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf that was intentional. Is there a reason to allows someone besides root. If we do end up setting up a group for our use then group would also make sense. We do not seem to have a consistent perms model at this time.

@jodh-intel this is something we need to decide if we will support a model where we create group for virt containers, give permission to kvm etc as part of de-priv the runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcastelino ok I understand. BTW, a group would be great to de-priv the runtime/qemu but the hard part is about translating this group inside the VM. Some implementation for the agent for sure.

network.go Outdated
@@ -30,10 +32,38 @@ import (
"golang.org/x/sys/unix"
)

// Introduces constants related to network routes.
// NetworkInterworkingModel defines the type of container to
// Virtcontainer network interworking models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something like:

// NetInterworkingModel defines the network model connecting
// the container interface to the virtual machine.

// ModelBridged uses a linux bridge to interconnect
// the container interface to the VM. This is the
// safe default that works for most cases except
// macvlan and ipvlan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we saying macvlan and ipvlan here instead of macvtap and ipvtap ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf afaik macvtap is a TAP interface for macvlan. So macvlan seems correct here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf @sameo is right. The CNI/CNM plugins today create macvlan and ipvlan. We need to enhance/extend them the near future to create ipvtap and macvtap. So the comment here is valid. For the ipvtap and macvtap we wil use a different interworking model, where we do not need to create any new interfaces but tap directly into them.

network.go Outdated
// Introduces constants related to network routes.
// NetworkInterworkingModel defines the type of container to
// Virtcontainer network interworking models
type NetInterworkingModel int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this NetInterConnectModel instead of NetInterworkingModel. Don't you think it would be more appropriate ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf I would not. The reason is that it is an interworking model. There are cases for macvtap direct and ipvtap and the future veth-tap where we will not be doing any interconnection. We will be directly tapping in. Hence interworking model, instead of interconnection model.

type NetworkInterfacePair struct {
ID string
Name string
VirtIface NetworkInterface
TAPIface NetworkInterface
NetInterworkingModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you embedding the type here ? What's the benefits compared to the definition of a new field of type NetInterworkingModel ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf there is no need to define a field when there is only one element of the type. It makes the code simpler in some cases as you can directly deference the fields of the embedded type.

if err := netHandle.LinkSetUp(vethLink); err != nil {
return fmt.Errorf("Could not enable veth %s: %s", netPair.VirtIface.Name, err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think we could try to factorize most of the code in this function tapNetworkPair() since a lot is identical with bridgeNetworkPair()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf the ordering is different between the two. If not we get issues like ARP table corruption. I need to see if the bridge+linux code can use the same order or if I can find a consistent order. But for now I would keep them seperate.

// Setup the multiqueue fds to be consumed by QEMU as macvtap cannot
// be directly connected.
// Ideally we want
// netdev.FDs, err = createMacvtapFds(netdev.ID, int(config.SMP.CPUs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cannot you do that ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf we need more context. I feel we need to pass in global context to all the APIs all the time. This will be needed in the future.

@@ -307,6 +533,35 @@ func bridgeNetworkPair(netPair *NetworkInterfacePair) error {
return nil
}

func untapNetworkPair(netPair NetworkInterfacePair) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for this function, I think we can factorize a lot with unBridgeNetworkPair().

network.go Outdated
@@ -440,11 +455,17 @@ func tapNetworkPair(netPair *NetworkInterfacePair) error {
return fmt.Errorf("Could not enable TAP %s: %s", netPair.TAPIface.Name, err)
}

// Clear the IP addresses from the veth interface to prevent ARP conflict
err = clearIPs(vethLink)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use instead:

if err := clearIPs(vethLink); err != nil {
    ...

network.go Outdated
}

for _, addr := range addrs {
err = netlink.AddrDel(link, &addr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use instead

if err := netlink.AddrDel(link, &addr); err != nil {
    ...

network.go Outdated
// that matches our minimum vCPU configuration
// Another option is to defer this to ciao qemu library which does have
// global context but cannot handle errors when setting up the network
netPair.VmFds, err = createMacvtapFds(tapLink.Attrs().Index, 8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you "surface" that magic number into a const?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel ideally this should be a constant at all. This should match the number of vCPUs for that POD/container and capped at a some limit. But for now I will move it up to a constant till we get global context.

@mcastelino mcastelino force-pushed the topic/macvtap branch 2 times, most recently from ed2c6e4 to 7e187e2 Compare October 12, 2017 18:40
@sameo
Copy link
Collaborator

sameo commented Oct 12, 2017

@mcastelino will address the code design comments in follow up PRs. So for now:
LGTM

Approved with PullApprove Approved with PullApprove

Copy link
Collaborator

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the fact that a follow up PR will refactor some code:
LGTM

@mcastelino mcastelino changed the title WIP: Use macvlan to connect the VM to the container network Use macvlan to connect the VM to the container network Oct 26, 2017
Revendor the netlink library to pick up changes required to
support macvtap creation within a network namespace.

Signed-off-by: Manohar Castelino <[email protected]>
@sboeuf
Copy link
Collaborator

sboeuf commented Oct 26, 2017

@mcastelino you have some go vet issues here

Provide multiple methods to connect the Virtual machine
to the container network. The current implementation allows
this to be chosen at a node level. In the future we can
enhance this to be dynamic, where the container interface
type is used to determine the optimal interconnection method.

Add support for multi-queue macvtap as an alternate means to
connect the container network interface to the virtual machine.

Include a workaround for kernel limitation around macvtap index
generation logic when interfaces are created within a namespace.

The switch from bridge+tap to macvtap brings in support for
  - multiqueue
  - lower latency
  - lower jitter
  - lower cpu consumption

Longer term this need to be configurable and the runtime should
be able to choose the networking model at a per pod level.

Signed-off-by: Manohar Castelino <[email protected]>
@coveralls
Copy link

coveralls commented Oct 26, 2017

Coverage Status

Coverage decreased (-1.4%) to 64.183% when pulling 8910d1b on mcastelino:topic/macvtap into 68f1c9e on containers:master.

@mcastelino
Copy link
Collaborator Author

@sboeuf this PR works as long as it is rebased on top of what cc-runtime is tracking which is
e8b0c3a

So the question is, how was CI passing on tip of virtcontainers prior to this PR?

However this PR applied on master of virtcontainers will break due to existing
breaking API changes. So there are changes in virtcontainers that break cc-runtime and this PR rebased on tip is catching it.

To make matters worse we have ciao rename from https://github.com/ciao-project/ciao to https://github.com/ciao-project/ciao

@chavafg how does our CI handle the case where a PR lands on the tip of virtcontainers which has breaking API changes and needs cc-runtime to be changed at the same time to accept it?

We should rethink our vendoring strategy where we are vendoring in two places and there is so much co dependency.

@sboeuf
Copy link
Collaborator

sboeuf commented Oct 28, 2017

@mcastelino yeah I understand... The thing is that right now our CI for virtcontainers copies the thing from virtcontainers to the runtime in order to test a PR, but it does not work if this same PR revendor ciao from virtcontainers because those changes are not taken into account...
That's the thing about trying to test upward (I mean testing a consumer of our package), instead of the basic testing of the repo itself. I think in cases like your PR, we should simply make an exception and merge it if we agree on it. It is gonna be properly tested when revendoring virtcontainers into the runtime anyway.

About the vendoring, I don't see the problem. If we don't vendor our virtcontainers library, it's gonna get broken all the time and this is gonna be annoying to maintain. From the runtime perspective, dep binary is always gonna be able to make the proper vendoring of virtcontainers even if the runtime is using the same packages than virtcontainers.

BTW, about a case in between, that is when we break compatibility from virtcontainers (and not from a vendored package), I have this PR #454 to solve this.

@mcastelino
Copy link
Collaborator Author

@sboeuf We have few things going wrong

We vendor packages in runtime and virtcontainers and they get vendored at different points in time
And virtcontainers is always ahead of runtime. Which means any significant changes to virtcontainers dependencies will mean our virtcontainers CI which runs runtime tests will not work.

So to fix this we should either
a. Not run runtime CI as part of virtcontainers and wait for the CI of runtime once things are properly vendored
b. Improve our virtcontainers CI so that it does a full virtual revendor prior to run

To be able to do b, it means runtime should not use revision based vendoring for any common dependencies and for virtcontainer dependencies that get pulled into runtime.

However this does not address the case where there are API breaking changes in the virtcontainers (which seems to be the same here). A API breaking change cannot be handled by our PR in any case, as it needs changes to the runtime. To handle API breaking changes we need approach a.

However what I do not get here is, why is the CI failing for my PR, which has not made any API breaking changes and how did the API breaking change get merged into virtcontainers. Maybe I am missing something.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants