Skip to content
Advertisement

How does OCI/runc system path constraining work to prevent remounting such paths?

The background of my question is a set of test cases for my Linux-kernel Namespaces discovery Go package lxkns where I create a new child user namespace as well as a new child PID namespace inside a test container. I then need to remount /proc, otherwise I would see the wrong process information and cannot lookup the correct process-related information, such as the namespaces of the test process inside the new child user+PID namespaces (without resorting to guerilla tactics).

The test harness/test setup is essentially this and fails without --privileged (I’m simplifying to all caps and switching off seccomp and apparmor in order to cut through to the real meat):

JavaScript

Of course, the path of least of least resistance as well as least beauty is to use --privileged, which will get the job done and as this is a throw-away test container (maybe there is beauty in the sheer lack of it).

Recently, I became aware of Docker’s --security-opt systempaths=unconfined, which (afaik) translates into an empty readonlyPaths in the resulting OCI/runc container spec. The following Docker run command succeeds as needed, it just returns silently in the example, so it was carried out correctly:

JavaScript

In case of the failing setup, when running without --privilege and without --security-opt systempaths=unconfined, the mounts inside the child user and PID namespaces inside the container look as follows:

JavaScript
  1. what mechanism exactly is blocking the fresh mount of procfs on /proc?
  2. what is preventing me from unmounting /proc/kcore, etc.?

Advertisement

Answer

Quite some more digging turned up this answer to “About mounting and unmounting inherited mounts inside a newly-created mount namespace” which points in the correct direction, but needs additional explanations (not least due to basing on a misleading paragraph about mount namespaces being hierarchical from man pages which Michael Kerrisk fixed some time ago).

Our starting point is when runc sets up the (test) container, for masking system paths especially in the container’s future /proc tree, it creates a set of new mounts to either mask out individual files using /dev/null or subdirectories using tmpfs. This results in procfs being mounted on /proc, as well as further sub-mounts.

Now the test container starts and at some point a process unshares into a new user namespace. Please keep in mind that this new user namespace (again) belongs to the (real) root user with UID 0, as a default Docker installation won’t enable running containers in new user namespaces.

Next, the test process also unshares into a new mount namespace, so this new mount namespace belongs to the newly created user namespace, but not to the initial user namespace. According to section “restrictions on mount namespaces” in mount_namespaces(7):

If the new namespace and the namespace from which the mount point list was copied are owned by different user namespaces, then the new mount namespace is considered less privileged.

Please note that the criterion here is: the “donor” mount namespace and the new mount namespace have different user namespaces; it doesn’t matter whether they have the same owner user (UID), or not.

The important clue now is:

Mounts that come as a single unit from a more privileged mount namespace are locked together and may not be separated in a less privileged mount namespace. (The unshare(2) CLONE_NEWNS operation brings across all of the mounts from the original mount namespace as a single unit, and recursive mounts that propagate between mount namespaces propagate as a single unit.)

As it now is not possible anymore to separate the /proc mountpoint as well as the masking submounts, it’s not possible to (re)mount /proc (question 1). In the same sense, it is impossible to unmount /proc/kcore, because that would allow unmasking (question 2).

Now, when deploying the test container using --security-opt systempaths=unconfined this results in a single /proc mount only, without any of the masking submounts. In consequence and according to the man page rules cited above, there is only a single mount which we are allowed to (re)mount, subject to the CAP_SYS_ADMIN capability including also mounting (besides tons of other interesting functionality).

Please note that it is possible to unmount masked /proc/ paths inside the container while still in the original (=initial) user namespace and when possessing (not surprisingly) CAP_SYS_ADMIN. The (b)lock only kicks in with a separate user namespace, hence some projects striving for deploying containers in their own new user namespaces (which unfortunately has effects not least on container networking).

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement