How to debug issues with volumes mounted on rootless containers

Get a fundamental understanding of user namespace to better understand rootless Podman containers.

Posted: June 10, 2021 by Matthew Heon (Red Hat)

Volume control — Image by Thomas Breher from Pixabay

One of the most frequent questions I am asked about rootless Podman is how to debug issues with volumes mounted into the container. This question is deceptively hard. In many ways, running Podman without root is almost identical to running it as root. Unfortunately, this is not always true, and volumes are one of the areas with the most significant differences. Here, I’ll explain in detail what these differences are, what kinds of errors they can cause, and how you can work around them. To start, we need some background information on how rootless containers work, beginning with one of the most fundamental features of rootless Podman: User namespaces.

The user namespace

One of the fundamental security features of containers is Linux kernel namespaces. A namespace is a way of isolating a process (or group of processes) from the rest of the system by limiting what it can see. There are many different namespaces, each with a different effect. For example, the PID namespace limits what PIDs a process can view—only PIDs within the namespace are visible and are numbered independently from the host. The process still has a PID on the host as well, so PID 2000 on the host could be PID 1 in the namespace. As of this writing, the namespaces the kernel provides are Mount, PID, Network, IPC, UTS, User, cgroup, and time, each isolating a different aspect of the system; the one we care most about for this blog is the user namespace.

User namespaces isolate the users and groups available in the container from those available to the host system. A user namespace works by mapping users in the container to users on the host. For example, we could map users 0 to 1000 in the container to users 100000 to 101000 on the host (groups are also mapped in an identical manner, but we’ll focus on users for simplicity). This mapping acts very similarly to the PID namespace we described above, but with users.

From the host, all accesses from root in the container (UID 0) will appear to be from UID 100000. Inside the container, any file owned by user 100000 on the host will appear as owned by UID 0 (root). An interesting question is what happens to users not mapped into the container—what if I mount a volume owned by user 1001 on the host into a container using the user namespace I described? The kernel, in this case, will map any UID or GID not valid in the namespace to UID/GID 65534, a user and group called nobody, which is not a normal group.

It is still possible to interact with files owned by nobody if permissions allow (e.g., you can read a file owned by nobody that is world-readable and write to that file if it is world-writable), but you cannot change ownership to it. User namespaces also grant limited versions of specific capabilities that are normally only available to root—the typical example is that user namespaces can mount certain types of filesystems, like tmpfs.

User namespaces are extremely useful because they allow us to act as root within the container without actually being root on the system. We can use user namespaces to separate containers from different users on multi-tenant systems—one user’s containers would be run as UIDs 10000 to 10999, another’s as 11000 to 11999, and so on. Within each container, it appears as though the application is root, and through the limited capabilities granted, it can do most common operations (installing packages, for example).

However, suppose an application manages to break out of the container. In that case, it is not running as root on the system and not running as the same UID and GID as the containers of any other user—the ability to attack different parts of the system is extremely limited.

However, their adoption was limited by technical limitations—most notably, the fact that there was until very recently, no way to remap UIDs and GIDs at the filesystem level. To have a container using UIDs 10000 to 10999, we had to make a copy of its image and then chown every UID in said image by adding 10000 to the existing UID. This chown can be very slow and (in many filesystems) dramatically increases the amount of space required.

[ Getting started with containers? Check out this free course. Deploying containerized applications: A technical overview. ]

Rootless containers

Where user namespaces have become extremely useful and popular are rootless containers. A non-root user in Linux has access to only one UID and GID (their own). However, containers expect to access more than one user and group—many files in container images are not owned by root, and applications will often be run as a non-root user in the container to enforce privilege separation within the container.

For some environments (high-performance computing, HPC, being a notable one), having only a single user and group in the container is acceptable (even desirable). For most other environments, one user and group is a severe limitation to the usefulness of the container. We can use user namespaces to gain these additional users and groups that we need to act like a typical container.

However, we need elevated privileges to achieve these extra users and groups. This is what the newuidmap and newgidmap executables (and the /etc/subuid and /etc/subgid config files they read) do—they grant us access to a block of users and groups, which are then mapped into a user namespace for rootless containers to use. The limitations of user namespaces regarding filesystem support still apply to a degree but are mitigated by only having a single set of UIDs and GIDs for each user to use.

Also of note is the fact that the kernel will automatically handle the chown operation for us if we unpack the image inside of a user namespace. The user shifting of the namespace ensures that the correct UID is assigned on creation instead of requiring the container runtime to set it manually.

The added capabilities of a user namespace are also essential for some of the things rootless containers need—without the ability to mount FUSE and tmpfs filesystems, rootless containers would be much more limited (to the point of being almost unusable).

User namespaces in Podman

Now that we understand how user namespaces in general work, let’s discuss how they are implemented in rootless Podman.

All rootless Podman containers are run in a user namespace, even if the user does not have more than one UID and GID available. All of a user’s containers share a single user namespace, held open by the rootless pause process. Namespaces are usually pruned by the kernel when no more processes are in them, so keeping a process around that does nothing but sleep and never exits will keep the user namespace alive.

The first thing that a rootless Podman process does is join the rootless user namespace (or create a new namespace and pause the process if they do not yet exist). As part of creating the user namespace, Podman will run the newuidmap and newgidmap executables to grant any additional UIDs and GIDs the user has been allocated in /etc/subuid and /etc/subgid (the default amount granted at user creation being 65536 of each). You can view the user mappings available in the podman info command, in the idMappings field:

mheon@podman-rhel8-test $ podman info
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536

Please note that the rootless user namespace is not re-created by default if the /etc/subuid and /etc/subgid files are changed; this is done by running the podman system migrate command. If you have edited these files and Podman does not seem to recognize your changes, run this command.

All rootless Podman commands are run within the rootless user namespace created to ensure that we have the correct user mappings and privileges. Even simple informational commands, like podman info, require the rootless user namespace. As such, a nonfunctional user namespace is a showstopper for rootless Podman. Fortunately, this does not happen often. When it does, we usually find that it’s caused by insufficient file permissions on the newuidmap and/or newgidmap binaries on the system (missing a file capability, usually). Reinstalling the package containing them (called shadow-utils on RHEL, CentOS, and Fedora) will usually resolve this.

Once the rootless user namespace has been created, we can begin running containers. Using volumes with these containers is the first point at which most users will encounter the practical differences between root and rootless Podman. The user will run a container with a volume mounted in and promptly discover that the container cannot access files in the volume, despite everything seemingly set correctly.

For example, let’s examine a simple Podman command:

podman run --user 1000:1000 -v /home/mheon/data:/data:Z ubi8 sh

This command is being run by my user, mheon, with UID and GID set to 1000 (the same user the container was instructed to use). My user has been allocated 65536 UIDs and GIDs starting at UID/GID 100000 via /etc/subuid and /etc/subgid. The SELinux context has been set so the container can access the directory via the :Z option on the volume mount.

However, access to the /data folder in the container will be denied, and the only error message that the system will return is a generic permission denied from the kernel. Any user that did not know what a user namespace is would have no idea what is wrong.

However, now that we do know, the cause should be obvious: The user mapping means that UID 1000 in the container is not actually UID 1000 on the host. You can view the user in the container and the actual user on the host via the podman top command:

mheon@podman-rhel8-test $ podman top -l user,huser
USER   HUSER
1000   100999

Here, USER is the user in the container, while HUSER is the user on the host.

Still, knowing why something is happening does not mean we know how to fix it. We still want to run our rootless Podman container with a specific volume mounted into it. How do we do that? Fortunately, there are many ways to fix this, which I will cover below.

The first solution

The first is simple: The --user option can be omitted from the container, running the container command as root. As noted above, by default, Podman maps the user running the container to root in the container—so now we’ll be accessing the volume as UID/GID 1000 on the host, despite being root in the container. Running as root in a rootful container is a potential security issue as you are running as the system’s root user—if an attacker broke out of the container, they would be able to act as root on the system. The entire point of a rootless container is that this is never true—the security issues are primarily a non-factor.

Unfortunately, the solution of running the container as root falls flat when the image is specifically written to use a non-root user. Some container images feature complex entry point scripts to drop permissions that cannot easily be modified. These will require an alternative solution.

The second solution

The second option is to grant the user running in the container permission to read and write the folder mounted in from the host. As of Podman v3.1.0, this can be done automatically via the :U volume option to the -v flag (e.g. -v /home/mheon/data:/data:Z,U).

Next enter podman unshare chown 1000:1000 /home/mheon/data. This volume option will automatically adjust the ownership of the directory, so the user running in the container—whatever user UID 1000 in the container is mapped to on the host—will own the directory. In versions without this flag, the podman unshare command can be used to enter the rootless user namespace and then chown the directory to be owned by the user running the container.

In this case, podman unshare chown 1000:1000 /home/mheon/data would change ownership of the directory on the host to the user and group that map to UID/GID 1000 in the user namespace. Please note that, if ownership is changed, all parent directories on the host will also require the execute permission for all users (chmod a+x…) permission to ensure the directory in question can be accessed.

Unfortunately, the chown approach does come with its own set of disadvantages. The /home/mheon/data directory is in my user’s home directory, but it is no longer owned by my user (in this case, it’s owned by user and group 100999). In the rootless user namespace, the mheon user can act as root and read, write, and modify files owned by that user; but it cannot do any of these things outside it. Podman does provide a command to enter a shell inside the rootless user namespace (podman unshare) that can be used to modify or remove such files, but the inability to manage these files otherwise is inconvenient.

The third solution

The third option is to use the --userns=keep-id option to podman run. This flag tells Podman to do two things: First, to set the user the container runs as to the UID and GID of the user that ran Podman (unless explicitly overridden by the --user flag), and second, to reorder the users mapped into the container such that the user that ran Podman is mapped to their own UID and GID, instead of root (this is done via a second user namespace, nested inside the rootless user namespace, created just for this container). The user in the user namespace that the container is running as is not root but still maps to the user running Podman (mheon) on the host and can thus access the directory mounted in the example /home/mheon/data.

This will not solve all access errors, however. Another common issue is attempting to mount in a file or device that the user running Podman has access to, but only by a supplemental group. For example, let’s say that my user, mheon, is part of the kvm group that owns /dev/kvm, and I choose to mount /dev/kvm into a container using podman run -t -i -v /dev/kvm:/dev/kvm fedora bash. The container will be unable to access /dev/kvm, despite the fact that it is running as my user on the host (which should have access).

The reason is that, for security purposes, containers will (by default) drop all additional groups as part of their creation. This behavior can be disabled, but only using the crun OCI runtime (this should be default as of Podman 3.0 on all distributions except RHEL) by passing a special annotation (--annotation run.oci.keep_original_groups=1).

In the upcoming Podman v3.2.0, this will be available via a special argument to the group-add flag (--group-add keep-groups). Please note that, while we can retain access to these groups, we do not have permission to add them to the rootless user namespace—we can only do that to the users and groups allocated to us in /etc/subuid and /etc/subgid. An ls -al on /dev/kvm in the container will find it is owned by nobody:nobody (as its actual owner and group, root:kvm, are not mapped into the container).

However, it can be accessed because, despite the fact that the kvm group is not part of the user namespace, the container process is still part of the group in the eyes of the kernel. This is somewhat limiting in that it is not possible to explicitly create files as one of these supplemental groups (as nobody is not a real group we can interact with), but it does suffice to give the container access to content on the host it would otherwise be unable to reach, and directories with the SUID bit owned by a supplemental group will still set the correct owner.

Another type of common error is encountered while pulling images with files or folders owned by high UID users. Any file or folder owned by a UID or GID too large to be included in the user namespace will produce an error. I previously wrote a blog about this and potential solutions, which can be found here.

Conclusion

One of Podman’s strongest features is our strong support for rootless containers, and it’s not hard to see why people are excited. Rootless containers are easy to set up, more secure than root containers, and can do almost anything a container run as root can do. Of course, the key word is almost—because the overall experience with root and rootless are so similar, differences can be confusing and often are not easy to explain. After reading this blog, you should have a strong grasp of one of the largest of these differences and how to work with Podman to get your containers running the way you want.

Why can’t rootless Podman pull my image?

Restrictions placed on rootless containers can be inconvenient, but there's always some sacrifice of convenience and usability for security improvements.