How does OCI/runc system path constraining work to prevent remounting such paths?

Question

The background of my question is a set of test cases for my Linux-kernel Namespaces discovery Go package lxkns where I create a new child user namespace as well as a new child PID namespace inside a test container. I then need to remount /proc, otherwise I would see the wrong process information and cannot lookup the correct process-related information,

Accepted Answer

Quite some more digging turned up this answer to &#8220;About mounting and unmounting inherited mounts inside a newly-created mount namespace&#8221; which points in the correct direction, but needs additional explanations (not least due to basing on a misleading paragraph about mount namespaces being hierarchical from man pages which Michael Kerrisk fixed some time ago).Our starting point is when runc sets up the (test) container, for masking system paths especially in the container&#8217;s future /proc tree, it creates a set of new mounts to either mask out individual files using /dev/null or subdirectories using tmpfs. This results in procfs being mounted on /proc, as well as further sub-mounts.Now the test container starts and at some point a process unshares into a new user namespace. Please keep in mind that this new user namespace (again) belongs to the (real) root user with UID 0, as a default Docker installation won&#8217;t enable running containers in new user namespaces.Next, the test process also unshares into a new mount namespace, so this new mount namespace belongs to the newly created user namespace, but not to the initial user namespace. According to section &#8220;restrictions on mount namespaces&#8221; in mount_namespaces(7):If the new namespace and the namespace from which the mount point list was copied are owned by different user namespaces, then the new mount namespace is considered less privileged.Please note that the criterion here is: the &#8220;donor&#8221; mount namespace and the new mount namespace have different user namespaces; it doesn&#8217;t matter whether they have the same owner user (UID), or not.The important clue now is:Mounts that come as a single unit from a more privileged mount namespace are locked together and may not be separated in a less privileged mount namespace. (The unshare(2) CLONE_NEWNS operation brings across all of the mounts from the original mount namespace as a single unit, and recursive mounts that propagate between mount namespaces propagate as a single unit.)As it now is not possible anymore to separate the /proc mountpoint as well as the masking submounts, it&#8217;s not possible to (re)mount /proc (question 1). In the same sense, it is impossible to unmount /proc/kcore, because that would allow unmasking (question 2).Now, when deploying the test container using --security-opt systempaths=unconfined this results in a single /proc mount only, without any of the masking submounts. In consequence and according to the man page rules cited above, there is only a single mount which we are allowed to (re)mount, subject to the CAP_SYS_ADMIN capability including also mounting (besides tons of other interesting functionality).Please note that it is possible to unmount masked /proc/ paths inside the container while still in the original (=initial) user namespace and when possessing (not surprisingly) CAP_SYS_ADMIN. The (b)lock only kicks in with a separate user namespace, hence some projects striving for deploying containers in their own new user namespaces (which unfortunately has effects not least on container networking).

Advertisement

Answer