Avatar

Linux containers, as a lighter virtualization alternative to virtual machines, are gaining momentum. The High Performance Computing (HPC) community is eyeing Linux containers with interest, hoping that they can provide the isolation and configurability of Virtual Machines, but without the performance penalties.

In this article, I will show a simple example of libvirt-based container configuration in which I assign the container one of the ultra-low latency (usNIC) enabled Ethernet interfaces available in the host. This allows bare-metal performance of HPC applications, but within the confines of a Linux container.

Before we jump into the specific libvirt configuration details, let’s first quickly review the following points:

  1. What “container” means in the context of this article.
  2. What limitations exist making it impossible to rely solely on (the available) namespaces to assign host devices to containers and guarantee some kind of isolation.
  3. What tools can be used to bridge the above-mentioned gaps.

Introduction to Linux Containers

Fun fact: there is no formal definition of a Linux “container.” Most people identify a Linux container with keywords like LXC, libvirt, Docker, namespaces, cgroups, etc.

Some of those keywords identify user space tools used to configure and manage some form of containers (LXC, libvirt, and Docker). Others identify some of the building blocks used to define a container (namespaces and cgroups).

Even in the Linux kernel, there is no definition of a “container.”

However, the kernel does provide a number of features that can be combined to define what many people call a “container.” None of these features are mandatory, and depending on what level of sharing or isolation you need between containers — or between the host and containers — the definition/configuration of a “container” will (or will not) make use of certain features.

In the context of this article, I will focus on assignment of usNIC enabled devices in libvirt-based LXC containers. For simplicity, I will ignore all security-related aspects.

Network namespaces, PCI, and filesystems

Given the relationship between devices and the filesystem, I will focus on filesystem related aspects and ignore the other commonly configured parts of a container, such as CPU, generic devices, etc.

Assigning containers their own view of the filesystem, with different degrees of sharing between host filesystem and container filesystem, is already possible and easy to achieve (see mount documentation for namespaces). However, what is still not possible is to partition or virtualize (i.e., make namespace-aware) certain parts of the filesystem.

Filesystem elements such as the virtual filesystems commonly mounted in /proc, /sys, and /dev are examples that fall into that category. These special filesystems provide a lot of information and configuration knobs that you may not want to share between the host and all containers, or between containers.

Also, a number of device drivers place special files in /dev that user space can use to interact with the devices via the device driver.

Even though network interfaces do not normally need to add anything to /dev/ (i.e., there is no /dev/enp7s0f0), usNIC enabled Ethernet interfaces have entries in /dev because the Libfabric and Verbs libraries require to access those entries.

Sidenote: For more information on why modern Linux distribution do not use interface names like ethX any more, and how names like enp7s0f0 are derived, see this document.

The tools you use to manage containers may assign a new network namespace to each container you create by default, or may need you to explicitly ask for that. Libvirt, as explained here, does that automatically when you assign a host network interface to the container. Specifically: when you create a new network namespace, you have the option of moving into the container any of the network interfaces (e.g., enp7s0f0) available in the host.

You can do this by hand using the ip link command, or you can have that assignment taken care for you by one of the container management tools. Later we will see how libvirt does that for us.

Once you have moved a network interface into a container, that network device will be only visible and usable inside that container.

(a) (b) (c)
Figure 1: (a) host with no containers (b) container that has been assigned a new network namespace which shares all network interfaces with the host (c) container that has been assigned a new network namespace and one of the host network interfaces (no longer visible in the host)

However, the Ethernet adapter also has an identity as a PCI device. As such, it appears in /sys and can be seen via commands like lspci from any network namespace — not only from the one where the associated network device (enp7s0f0) lives.

This gap derives from the fact that the Ethernet device is hooked to both the PCI layer and the networking layer, but only the latter has been assigned a namespace.

(a) (b) (c)
Figure 2: (a) host with no containers (b) container that has been assigned a new network namespace which can not access any of the host network interfaces (c) container that has been assigned a new network namespace and one of the host network interfaces.

Tools you can use to assign devices to containers

You can classify containers based on different criteria, such as based on what they will be used to run inside. At the two extremes, you have these options:

  • Application container
  • Distribution container

In the first case, you only need to populate the container filesystem with what is strictly needed to run a given application. Most likely, not much more than a set of libraries. Other parts of the filesystem may be shared with the host (including the virtual filesystems), or may not be needed at all.

In the second case, you want to assign the container a full filesystem and have less (if any) sharing with the host filesystem, including the special entries like /proc, /sys, /dev, etc.

Even though full distribution container support is still not considered “ready for prime time” due to the limitations imposed by a few special filesystems as discussed above, there are a number of generic tools available that can be used to provide some kind of device/resource assignments and isolation between containers:

  • Security infrastructures like selinux and apparmor
  • Bind mounts
  • Cgroup device controller (via device whitelists)
  • Etc.

You can check LXD for an example of project whose goal is to add whatever is missing in order make containers as isolated as virtual machines in terms of resource usage/access.

In section “Example of libvirt LXC container configuration” we will see a simple example of how you can tell libvirt to use bind mounts and cgroup device controllers to assign a usNIC enabled Ethernet interface to a container.

Support for bind mounts has been available for long time (see man mount for the details).

cgroup device controller support may already be enabled on your distro by default. But if not, you can enable it with this kernel configuration option:

  • General setup
    • Control Group support
      • Device controller for cgroups

You can find some documentation about this feature in the kernel file Documentation/cgroups/devices.txt. We will not configure it manually as described in that document; instead, we will tell libvirt to do that for us.

Loading the required kernel modules and understanding the role of key filesystem entries

For a detailed description of how to deploy usNIC you can refer to the usNIC deployment guide (available at cisco.com). Keep in mind that:

  1. The installation of the kernel modules is only needed in the host (not the container).
  2. In the container filesystem, you only need to install the user space libraries and packages.

The only missing point, which is the focus of this article, is to make sure certain files created by step 1) will be visible and usable inside the container’s filesystem.

Normally, users do not need to have a detailed knowledge of what files are created by the kernel modules and used by the user space libraries. In our case, however, we do need to have some knowledge about these files in order to properly populate the container filesystem.

Before I show you the libvirt XML configuration, let’s first discuss the role of three key file/directories we will need to tell libvirt about.

Once you have created a “Virtual NIC” (vNIC) on the Cisco UCS Virtual Interface Card (VIC) and enabled the usNIC feature in it (per the Cisco documentation cited above), you will see the following three filesystem entries in the host:

  1. /dev/infiniband/uverbsX
    This is a character device used by the user space library to configure a usNIC enabled network interface.
  2. /sys/class/infiniband/usnic_X/
    This is a directory used by the usNIC kernel driver to export a number of configuration parameters. For example, the iface file in this directory tells you with which network interface (visible with ifconfig) this usNIC entry is associated to.
  3. /sys/class/infiniband_verbs/uverbsX/
    Among the data exported here by the Linux Verbs API, you may find useful these two files:

    • dev
      This is the major:minor device ID which will match with what you will see in /dev/infiniband/uverbsX. You can refer back to this information when/if you want to check if libvirt configures the cgroup device whitelist properly (see example, below).
    • ibdev
      This is the associated usnic_X entry in /sys/class/infiniband/usnic_X/

Note that:

  • The /sys/class/infiniband/usnic_X/ directory will be populated when you load the usNIC kernel driver module (i.e., usnic_verbs.ko).
  • The /dev/infiniband/ and /sys/class/infiniband_verbs/ directories also will be populated when you load the usNIC kernel driver module.

In order to find the mapping between one of the network interfaces visible with ifconfig and the associated uverbsX entry in /dev/infiniband, you can either use the files in /sys described above, or use the usd_devinfo command that comes with the usnic-utils package.

Example of libvirt LXC container configuration

Libvirt describes the configuration of containers (as well as virtual machines) with an XML file. Here is a link to detailed documentation of all libvirt’s XML options. In the context of this article, I recommend reading the following sections of that documentation:

  • Filesystem mounts
  • Device nodes
  • Filesystem isolation
  • Device access

Let’s start with a simple container configuration and add the delta needed to assign one usNIC enabled host Ethernet interface to the container. This example shows how to create a container on a Cisco UCS C240-M3 rack server running Centos 7.

Here is a stripped-down version of the container XML; I have removed the details that are not relevant for this discussion:


<domain type='lxc'>
  <name>container_1</name>
  <memory unit='GiB'>8</memory>
  <currentMemory unit='GiB'>0</currentMemory>
  <os>
    <type arch='x86_64'>exe</type>
    <init>/sbin/init</init>
  </os>
  <devices>
    <filesystem type='mount' accessmode='passthrough'>
      <source dir='/usr/local/var/lib/lxc/container_1/rootfs'/>
      <target dir='/'/>
    </filesystem>
  <console type='pty'/>
  </devices>
</domain>

 

The only detail worth noting is that the container root filesystem is located at /usr/local/var/lib/lxc/container_1/rootfs in the host.

Note that with this basic configuration, and according to the section “Device Nodes” mentioned above, the container’s /dev tree will not contain any of the special entries from the host’s /dev tree, including the /dev/infiniband directory that we need for usNIC:


[container_1]# ls /dev/infiniband
 ls: cannot access /dev/infiniband: No such file or directory

 

However, since /sys is shared with the host, you can see the entries associated to usNIC enabled Ethernet interfaces:


[container_1]# find /sys/class -name uverbs*
 /sys/class/infiniband_verbs/uverbs0
 /sys/class/infiniband_verbs/uverbs1
 /sys/class/infiniband_verbs/uverbs2
 /sys/class/infiniband_verbs/uverbs3
[container_1]# find /sys/class -name usnic*
 /sys/class/infiniband/usnic_0
 /sys/class/infiniband/usnic_1
 /sys/class/infiniband/usnic_2
 /sys/class/infiniband/usnic_3

 

But notice that none of the /dev/infiniband/uverbsX devices are present (yet) in the container. Running a simple usNIC diagnostic program in the container shows warnings (one for each device I have configured on my server):


[container_1]# /opt/cisco/usnic/bin/usd_devinfo
 usd_open_for_attrs: No such device
 usd_open_for_attrs: No such device
 usd_open_for_attrs: No such device
 usd_open_for_attrs: No such device

 

Since we did not assign any host network interface to the container, by default, libvirt allowed the container to see all Ethernet interfaces (i.e., it did not create a new network namespace):


[container_1]# ip link
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 8: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000
 link/ether 00:25:b5:00:00:04 brd ff:ff:ff:ff:ff:ff
 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000
 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff
 10: enp8s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000
 link/ether 00:25:b5:00:00:24 brd ff:ff:ff:ff:ff:ff
 11: enp9s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000
 link/ether 00:25:b5:01:01:0f brd ff:ff:ff:ff:ff:ff

 

Now we edit the libvirt configuration to assign one usNIC enabled interface to the container. This means that inside the container:

  1. /dev/infiniband/ will show an entry for the assigned usNIC enabled interface
  2. ifconfig will also show the usNIC enabled Ethernet interface .

Let’s assign enp7s0f0 (i.e., usnic_1) to the container. Here is the new libvirt LXC container configuration (the changes compared to container_1 are shown in red):


<domain type='lxc'>
  <name>container_2</name>
  <memory unit='GiB'>8</memory>
  <currentMemory unit='GiB'>0</currentMemory>
  <os>
    <type arch='x86_64'>exe</type>
    <init>/sbin/init</init>
  </os>
  <devices>
    <filesystem type='mount' accessmode='passthrough'>
      <source dir='/usr/local/var/lib/lxc/centos_container/rootfs'/>
      <target dir='/'/>
    </filesystem>
    <hostdev mode='capabilities' type='misc'>
      <source>
        <char>/dev/infiniband/uverbs1</char>
      </source>
    </hostdev>
    <hostdev mode='capabilities' type='net'>
      <source>
        <interface>enp7s0f0</interface>
      </source>
    </hostdev>
  <console type='pty'/>
  </devices>
</domain>

 

You can find more details about the above two new pieces of configuration here.

If I start the container with the new “container_2” configuration, this is what I can see now from within it:

  1. Only one network interface (enp7s0f0)
  2. The device node /dev/infiniband/uverbs1
  3. The same four entries in /sys (as with the previous configuration container_1)

Specifically:


[container_2]# ip link
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000
 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff
[container_2]# ls -ls /dev/infiniband/
total 0
 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1
[container_2]# find /sys/class -name uverbs*
 /sys/class/infiniband_verbs/uverbs0
 /sys/class/infiniband_verbs/uverbs1
 /sys/class/infiniband_verbs/uverbs2
 /sys/class/infiniband_verbs/uverbs3
[container_2]# find /sys/class -name usnic*
 /sys/class/infiniband/usnic_0
 /sys/class/infiniband/usnic_1
 /sys/class/infiniband/usnic_2
 /sys/class/infiniband/usnic_3

 

Here is how the usNIC diagnostic command usd_devinfo shows the information about the visible usNIC enabled network interfaces (there are still some warnings because of the uverbsX entries that are present in /sys but not in /dev/infiniband):


[container_2]# /opt/cisco/usnic/bin/usd_devinfo           
usd_open_for_attrs: No such device
usnic_1:
        Interface:               enp7s0f0
        MAC Address:             00:25:b5:00:00:14
        IP Address:              10.0.7.1
        Netmask:                 255.255.255.0
        Prefix len:              24
        MTU:                     9000
        Link State:              UP
        Bandwidth:               10 Gb/s
        Device ID:               UCSB-PCIE-CSC-02 [VIC 1225] [0x0085]
        Firmware:                2.2(2.5)
        VFs:                     64
        CQ per VF:               6
        QP per VF:               6
        Max CQ:                  256
        Max CQ Entries:          65535
        Max QP:                  384
        Max Send Credits:        4095
        Max Recv Credits:        4095
        Capabilities:
          CQ sharing: yes
          PIO Sends:  no
usd_open_for_attrs: No such device
usd_open_for_attrs: No such device

 

Let’s compare the content of /dev/infiniband in the host and in the container:


[container_2]# ls -ls /dev/infiniband/
total 0
 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1

[host]# ls -ls /dev/infiniband/
total 0
 0 crw-rw-rw-. 1 root root 231, 192 Mar 31 17:30 uverbs0
 0 crw-rw-rw-. 1 root root 231, 193 Mar 31 17:30 uverbs1
 0 crw-rw-rw-. 1 root root 231, 194 Mar 31 17:30 uverbs2
 0 crw-rw-rw-. 1 root root 231, 195 Mar 31 17:30 uverbs3

 

As you can see, uverbs1 — and only uverbs1 — is visible in the container. The device major number for all uverbsX entries is 231, while the device minors are 192/193/194/195.

Let’s now compare the device.list device whitelist for the container and for the host:


[container_2]# cat /sys/fs/cgroup/devices/devices.list
 c 1:3 rwm
 c 1:5 rwm
 c 1:7 rwm
 c 1:8 rwm
 c 1:9 rwm
 c 5:0 rwm
 c 5:2 rwm
 c 10:229 rwm
 c 231:193 rwm c 136:* rwm

[host]# cat /sys/fs/cgroup/devices/devices.list
 a *:* rwm

 

As you can see from the two commands above:

  • The hostdev/misc entry in the libvirt XML config added the 231:193 rule to the container device whitelist
  • The rest of the devices are the default ones added by libvirt

We can see that “ping” works just fine from inside the container (using the enp7s0f0 interface):


[container_2]# ip addr show dev enp7s0f0
9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff
    inet 10.0.7.1/24 brd 10.0.7.255 scope global enp7s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::225:b5ff:fe00:14/64 scope link 
       valid_lft forever preferred_lft forever
[container_2]# ping -c 1 10.0.7.2
 PING 10.0.7.2 (10.0.7.2) 56(84) bytes of data.
 64 bytes from 10.0.7.2: icmp_seq=1 ttl=64 time=0.279 ms
--- 10.0.7.2 ping statistics ---
 1 packets transmitted, 1 received, 0% packet loss, time 0ms
 rtt min/avg/max/mdev = 0.279/0.279/0.279/0.000 ms

 

We can test the usnic_1 interface using the usd_pingpong command to another container, similarly configured with usNIC enabled interface on another Cisco UCS C240-M3 rack server connected on a regular IP/Ethernet network:


[container_2]# /opt/cisco/usnic/bin/usd_pingpong -d usnic_1 -h 10.0.7.2
open usnic_1 OK, IP=10.0.7.1
QP create OK, addr -h 10.0.7.1 -p 3333
sending params...
payload_size=4, pkt_size=46
posted 63 RX buffers, size=64 (4)
100000 pkts, 1.790 us / HRT

 

The 1.79 microsecond half-round trip ping-pong time (show in red, above) shows that we are getting bare-metal performance inside of the container.

Wrapup

As Linux containers become more mainstream — potentially even in HPC — it will become more important to understand how to expose native hardware functionality properly.  Documentation and “best practice” knowledge is still somewhat scarce in the rapidly-evolving Linux containers ecosystem; this blog entry explains some of the underlying concepts and shows some examples of how adding just a few lines of XML allows bare-metal performance with the isolation and configurability of Linux containers.



Authors

Christian Benvenuti

Technical Leader

CSPG UCS System Engineering