How does OCI/runc system path constraining work to prevent remounting such paths?

The background of my question is a set of test cases for my Linux-kernel Namespaces discovery Go package lxkns where I create a new child user namespace as well as a new child PID namespace inside a test container. I then need to remount /proc, otherwise I would see the wrong process information and cannot lookup the correct process-related information, such as the namespaces of the test process inside the new child user+PID namespaces (without resorting to guerilla tactics).

The test harness/test setup is essentially this and fails without --privileged (I’m simplifying to all caps and switching off seccomp and apparmor in order to cut through to the real meat):

docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined busybox unshare -Umpfr mount -t proc /proc proc
mount: permission denied (are you root?)

JavaScript
​x
 
docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined busybox unshare -Umpfr mount -t proc /proc procmount: permission denied (are you root?)​

Of course, the path of least of least resistance as well as least beauty is to use --privileged, which will get the job done and as this is a throw-away test container (maybe there is beauty in the sheer lack of it).

Recently, I became aware of Docker’s --security-opt systempaths=unconfined, which (afaik) translates into an empty readonlyPaths in the resulting OCI/runc container spec. The following Docker run command succeeds as needed, it just returns silently in the example, so it was carried out correctly:

docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined busybox unshare -Umpfr mount -t proc /proc proc

JavaScript
 
docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined busybox unshare -Umpfr mount -t proc /proc proc​

In case of the failing setup, when running without --privilege and without --security-opt systempaths=unconfined, the mounts inside the child user and PID namespaces inside the container look as follows:

docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined busybox unshare -Umpfr cat /proc/1/mountinfo
693 678 0:46 / / rw,relatime - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/AOY3ZSL2FQEO77CCDBKDOPEK7M:/var/lib/docker/overlay2/l/VNX7PING7ZLTIPXRDFSBMIOKKU,upperdir=/var/lib/docker/overlay2/60e8ad10362e49b621d2f3d603845ee24bda62d6d77de96a37ea0001c8454546/diff,workdir=/var/lib/docker/overlay2/60e8ad10362e49b621d2f3d603845ee24bda62d6d77de96a37ea0001c8454546/work,xino=off
694 693 0:50 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
695 694 0:50 /bus /proc/bus ro,relatime - proc proc rw
696 694 0:50 /fs /proc/fs ro,relatime - proc proc rw
697 694 0:50 /irq /proc/irq ro,relatime - proc proc rw
698 694 0:50 /sys /proc/sys ro,relatime - proc proc rw
699 694 0:50 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
700 694 0:51 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
701 694 0:51 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
702 694 0:51 /null /proc/latency_stats rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
703 694 0:51 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
704 694 0:51 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
705 694 0:56 / /proc/scsi ro,relatime - tmpfs tmpfs ro
706 693 0:51 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
707 706 0:52 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
708 706 0:49 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
709 706 0:55 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k
710 706 0:52 /0 /dev/console rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
711 693 0:53 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
712 711 0:54 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755
713 712 0:28 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/systemd ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
714 712 0:31 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/cpuset ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset
715 712 0:32 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/net_cls,net_prio ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
716 712 0:33 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
717 712 0:34 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/perf_event ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
718 712 0:35 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/devices ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
719 712 0:36 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/blkio ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
720 712 0:37 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/pids ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
721 712 0:38 / /sys/fs/cgroup/rdma ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
722 712 0:39 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/freezer ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
723 712 0:40 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
724 711 0:57 / /sys/firmware ro,relatime - tmpfs tmpfs ro
725 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,stripe=256
944 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/hostname /etc/hostname rw,relatime - ext4 /dev/sda2 rw,stripe=256
1352 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/hosts /etc/hosts rw,relatime - ext4 /dev/sda2 rw,stripe=256

JavaScript
 
docker run -it --rm --name closedboxx --cap-add ALL --security-opt seccomp=unconfined --security-opt apparmor=unconfined busybox unshare -Umpfr cat /proc/1/mountinfo693 678 0:46 / / rw,relatime - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/AOY3ZSL2FQEO77CCDBKDOPEK7M:/var/lib/docker/overlay2/l/VNX7PING7ZLTIPXRDFSBMIOKKU,upperdir=/var/lib/docker/overlay2/60e8ad10362e49b621d2f3d603845ee24bda62d6d77de96a37ea0001c8454546/diff,workdir=/var/lib/docker/overlay2/60e8ad10362e49b621d2f3d603845ee24bda62d6d77de96a37ea0001c8454546/work,xino=off694 693 0:50 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw695 694 0:50 /bus /proc/bus ro,relatime - proc proc rw696 694 0:50 /fs /proc/fs ro,relatime - proc proc rw697 694 0:50 /irq /proc/irq ro,relatime - proc proc rw698 694 0:50 /sys /proc/sys ro,relatime - proc proc rw699 694 0:50 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw700 694 0:51 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755701 694 0:51 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755702 694 0:51 /null /proc/latency_stats rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755703 694 0:51 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755704 694 0:51 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755705 694 0:56 / /proc/scsi ro,relatime - tmpfs tmpfs ro706 693 0:51 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755707 706 0:52 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666708 706 0:49 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw709 706 0:55 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k710 706 0:52 /0 /dev/console rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666711 693 0:53 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro712 711 0:54 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755713 712 0:28 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/systemd ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd714 712 0:31 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/cpuset ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset715 712 0:32 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/net_cls,net_prio ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio716 712 0:33 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory717 712 0:34 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/perf_event ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event718 712 0:35 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/devices ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices719 712 0:36 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/blkio ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio720 712 0:37 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/pids ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids721 712 0:38 / /sys/fs/cgroup/rdma ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma722 712 0:39 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/freezer ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer723 712 0:40 /docker/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0 /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct724 711 0:57 / /sys/firmware ro,relatime - tmpfs tmpfs ro725 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,stripe=256944 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/hostname /etc/hostname rw,relatime - ext4 /dev/sda2 rw,stripe=2561352 693 8:2 /var/lib/docker/containers/eebfacfdc6e0e34c4e62d9f162bdd7c04b232ba2d1f5327eaf7e00011d0235c0/hosts /etc/hosts rw,relatime - ext4 /dev/sda2 rw,stripe=256​

what mechanism exactly is blocking the fresh mount of procfs on /proc?
what is preventing me from unmounting /proc/kcore, etc.?

Answer

Quite some more digging turned up this answer to “About mounting and unmounting inherited mounts inside a newly-created mount namespace” which points in the correct direction, but needs additional explanations (not least due to basing on a misleading paragraph about mount namespaces being hierarchical from man pages which Michael Kerrisk fixed some time ago).

Our starting point is when runc sets up the (test) container, for masking system paths especially in the container’s future /proc tree, it creates a set of new mounts to either mask out individual files using /dev/null or subdirectories using tmpfs. This results in procfs being mounted on /proc, as well as further sub-mounts.

Now the test container starts and at some point a process unshares into a new user namespace. Please keep in mind that this new user namespace (again) belongs to the (real) root user with UID 0, as a default Docker installation won’t enable running containers in new user namespaces.

Next, the test process also unshares into a new mount namespace, so this new mount namespace belongs to the newly created user namespace, but not to the initial user namespace. According to section “restrictions on mount namespaces” in mount_namespaces(7):

If the new namespace and the namespace from which the mount point list was copied are owned by different user namespaces, then the new mount namespace is considered less privileged.

Please note that the criterion here is: the “donor” mount namespace and the new mount namespace have different user namespaces; it doesn’t matter whether they have the same owner user (UID), or not.

The important clue now is:

Mounts that come as a single unit from a more privileged mount namespace are locked together and may not be separated in a less privileged mount namespace. (The unshare(2) CLONE_NEWNS operation brings across all of the mounts from the original mount namespace as a single unit, and recursive mounts that propagate between mount namespaces propagate as a single unit.)

As it now is not possible anymore to separate the /proc mountpoint as well as the masking submounts, it’s not possible to (re)mount /proc (question 1). In the same sense, it is impossible to unmount /proc/kcore, because that would allow unmasking (question 2).

Now, when deploying the test container using --security-opt systempaths=unconfined this results in a single /proc mount only, without any of the masking submounts. In consequence and according to the man page rules cited above, there is only a single mount which we are allowed to (re)mount, subject to the CAP_SYS_ADMIN capability including also mounting (besides tons of other interesting functionality).

Please note that it is possible to unmount masked /proc/ paths inside the container while still in the original (=initial) user namespace and when possessing (not surprisingly) CAP_SYS_ADMIN. The (b)lock only kicks in with a separate user namespace, hence some projects striving for deploying containers in their own new user namespaces (which unfortunately has effects not least on container networking).

Advertisement

Answer