SELinux changes for KVM-separated (Kata) containers

Understanding SELinux types that improve security in container engines such as Podman and CRI-O.

Posted: August 17, 2020 by Dan Walsh (Red Hat)

SELinux Policy type for KVM containers — Image by Alexas_Fotos from Pixabay

It has been a while since I have written about SELinux, but I continue to work with it in containers.

SELinux process types: cat and dog — Illustration by Máirín Duffy from The SELinux Coloring Book. CC BY-SA 4.0.

Many years ago, I wrote the first SELinux policy for containers, before Docker existed. I was working on libvirt-lxc at the time, and containers launched out of libvirt. Later, when the Docker project hit the scene, I adapted the container policy to the Docker engine. The container-selinux policy and package were born. Most everyone that uses containers and SELinux is using this policy.

The way the policy is designed allows the container processes to do their thing inside the container. I often call this "what happens in Vegas stays in Vegas." All of the container processes run as the container_t type, and all of the container content is labeled as container_file_t. The allow rules basically say that container_t can read/write/execute all content that is labeled container_file_t. If a container breaks out and tries to write content in /usr or /root or /var, it is blocked unless the files are labeled container_file_t. The policy does allow read/execute on content in /usr, so it is easy to volume mount in executables from this directory into the container.

Since all container processes run with the same type, and all content is labeled container_file_t, SElinux type enforcement does not prevent container processes from attacking other containers. Luckily, SELinux has another label mechanism called Multi-Category Security (MCS).

SELinux policies are written with 1024 different categories: 0-1023. In the MLS world, these categories are translated into higher-level names. Note that in MCS policies, the 0 has no real meaning. In MCS, we don't add any meaning but take advantage of the categories to guarantee container uniqueness. Each container is assigned a combination of two categories, which gives approximately 500,000 unique containers on a system. You can read more about this in the link above. We assign an MCS level (like s0:c1,c2) to each container process and file, and the policy forces the MCS labels to match (or dominate), otherwise access is denied. For example, a container_t process running with MCS level of s0:c1,c2 is allowed to read/write all content with the container_file_t type and MCS level s0:c1,c2, s0:c1, s0:c2, s0. We usually stick to labeling content with s0:c1,c2, and s0. If a file has category s0:c3,c2, a container running with category s0:c1,c2 would not be able to read or write the files, since MCS says that a process categories have to dominate all categories in the file system object. Since each container runs with a unique pair of categories, no container can read or write another container's data.

SELinux dog process restrained by the kernel — Illustration by Máirín Duffy from The SELinux Coloring Book. CC BY-SA 4.0.

These rules have proven to be incredibly useful in blocking file system container escapes. Here is a list of some of the container escapes that have been blocked by SELinux:

CVE-2015-3629 Symlink traversal on container respawn allows local privilege escalation
CVE-2015-3627 Insecure opening of file-descriptor 1 leading to privilege escalation
CVE-2015-3630 Read/write proc paths allow host modification & information disclosure
CVE-2015-3631 Volume mounts allow LSM profile escalation
CVE-2016-9962 RunC Exec Vulnerability

Container engines

Container engines, like Podman and CRI-O, use the SELinux GO library to pick the types that will run with containers. Originally it read the /etc/selinux/targeted/contexts/lxc_contexts file.

$ cat /etc/selinux/targeted/contexts/lxc_contexts
process = "system_u:system_r:container_t:s0"
content = "system_u:object_r:virt_var_lib_t:s0"
file = "system_u:object_r:container_file_t:s0"
ro_file="system_u:object_r:container_ro_file_t:s0"
container_kvm_process = "system_u:system_r:container_kvm_t:s0"
sandbox_lxc_process = "system_u:system_r:container_t:s0"

The library reads the process and file fields and sets the labels appropriately.

The basic container_t type has served us well in the last few years, but I wanted to add some flexibility and some additional policy types.

Additional types

The container_t works pretty well for standard Linux containers, but as we have been working with Kata containers, we realized that we need a new type. A Kata container is different from a standard Linux container in that it runs inside a virtual machine. Whereas a standard container communicates directly with the kernel, a Kata container runs inside a guest kernel, and the host kernel only sees the virtual machine process, usually running qemu. Kata also uses the new virtiofs daemon to gain access to host files on systems via volumes. We want the same SELinux type to apply to this daemon. We use SELinux to prevent a rogue process inside a Kata container from taking advantage of a vulnerability in virtiofsd or qemu and using it to attack host content.

Both qemu and virtiofsd require different access from what standard Linux containers are currently allowed. We could have extended the container_t type to add these additional accesses. For example, qemu needs access to network tunneling devices and needs to create content on the hosts /run directory. The virtiofsd daemon needs to be able to mount some file systems. Adding this type of access to container_t means all containers would get access. I decided to go with the better security of generating a new container_kvm_t type, which could be created with only the access necessary for running qemu and virtiofsd.

I wrote a new policy type, container_kvm_t, which should be able to support KVM-separated containers without forcing us to extend additional permissions to container_t. I decided to stick with the container_file_t for content on the host, which allows us to continue to share content between different types of containers. Both container_kvm_t and container_t are still assigned unique MCS labels, guaranteeing separation between the containers.

Hopefully, the container_kvm_t type will work not only with qemu-launched containers, but also with VMs started by firecracker, and maybe even gvisord.

To get Kata to work with this new policy, I had to get some changes into the upstream Kata project. Basically, I wanted Kata to launch qemu with the new container type defined in the OCI Runtime Specification. Kata attempted to use this label inside of the VM. Since I believe we should be controlling the containers from the outside and most Kata containers don't have SELinux enabled inside of the VM, it made sense to move control outside. We have had some conversations about potentially supporting SELinux on the inside, but for now, it just applies to the VM and not the container processes inside of the VM.

The next step is to modify the GO bindings to allow container engines to pick the correct SELinux label for the type of container that they will run. I updated the container-selinux package to add a new file, /usr/share/containers/selinux/contexts.

$ cat /usr/share/containers/selinux/contexts
process = "system_u:system_r:container_t:s0"
file = "system_u:object_r:container_file_t:s0"
ro_file="system_u:object_r:container_ro_file_t:s0"
kvm_process = "system_u:system_r:container_kvm_t:s0"
init_process = "system_u:system_r:container_init_t:s0"

Now the GO bindings look for this file first and then fall back to the lxc_contexts file if it does not exist. Container engines, like Podman and CRI-O, can get a kvm_process container type or a process type depending on whether they are running a KVM-separated container or a traditional container.

The container_init_t type

Notice that there is also an init_process type, container_init_t. This type is for traditional Linux containers that run systemd as PID 1. Systemd-based containers expect to be able to modify the cgroup file system for processes. This is something I don't want traditional containers to be allowed to do. Years ago, we added an SELinux boolean container_manage_cgroup, which, when enabled, allows all containers to manipulate the cgroup file system if they gain access. With the new infrastructure, we can generate a new type container_init_t to be run with systemd-based containers. This allows other containers to run with tighter security and eliminates container users from having to manipulate the SELinux policy by turning on a boolean. Systemd-based containers running in Podman and CRI-O will work out of the box because they can access the cgroup configuration inside the guest kernel.

Generating your own SELinux types

If you want to customize and generate your own SELinux policy types for running containers, I advise you to look at the Udica project. This tool allows users to create their own types for specific containers. Please refer to this blog for more information.

Conclusion

We have added a couple more SELinux types into the container engine world, but I still believe that the number of different types should be minimal. With a significant increase in SELinux types comes confusion on how to use it. Making the container engines smart enough to understand which SELinux type to choose also keeps the complexity away from users while increasing the security on the system.

[ Free book: Building modern apps with Linux containers. ]

Dealing with user namespaces and SELinux on rootless containers

Learn how to securely run a MariaDB database container from the home directory.

Four semanage commands to keep SELinux in enforcing mode

Are you avoiding SELinux entirely, or leaving large portions of your systems in permissive mode? Read on to learn how to use the SELinux targeted policy to lock things down but maintain flexibility for custom applications.