While attempting to test Is it allowed to access memory that spans the zero boundary in x86? in user-space on Linux, I wrote a 32-bit test program that tries to map the low and high pages of 32-bit virtual address space.
After echo 0 | sudo tee /proc/sys/vm/mmap_min_addr
, I can map the zero page, but I don’t know why I can’t map -4096
, i.e. (void*)0xfffff000
, the highest page. Why does mmap2((void*)-4096)
return -ENOMEM
?
strace ./a.out execve("./a.out", ["./a.out"], 0x7ffe08827c10 /* 65 vars */) = 0 strace: [ Process PID=1407 runs in 32 bit mode. ] .... mmap2(0xfffff000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0
Also, what check is rejecting it in linux/mm/mmap.c
, and why is it designed that way? Is this part of making sure that creating a pointer to one-past-an-object doesn’t wrap around and break pointer comparisons, because ISO C and C++ allow creating a pointer to one-past-the-end, but otherwise not outside of objects.
I’m running under a 64-bit kernel (4.12.8-2-ARCH on Arch Linux), so 32-bit user space has the entire 4GiB available. (Unlike 64-bit code on a 64-bit kernel, or with a 32-bit kernel where the 2:2 or 3:1 user/kernel split would make the high page a kernel address.)
I haven’t tried from a minimal static executable (no CRT startup or libc, just asm) because I don’t think that would make a difference. None of the CRT startup system calls look suspicious.
While stopped at a breakpoint, I checked /proc/PID/maps
. The top page isn’t already in use. The stack includes the 2nd highest page, but the top page is unmapped.
00000000-00001000 rw-p 00000000 00:00 0 ### the mmap(0) result 08048000-08049000 r-xp 00000000 00:15 3120510 /home/peter/src/SO/a.out 08049000-0804a000 r--p 00000000 00:15 3120510 /home/peter/src/SO/a.out 0804a000-0804b000 rw-p 00001000 00:15 3120510 /home/peter/src/SO/a.out f7d81000-f7f3a000 r-xp 00000000 00:15 1511498 /usr/lib32/libc-2.25.so f7f3a000-f7f3c000 r--p 001b8000 00:15 1511498 /usr/lib32/libc-2.25.so f7f3c000-f7f3d000 rw-p 001ba000 00:15 1511498 /usr/lib32/libc-2.25.so f7f3d000-f7f40000 rw-p 00000000 00:00 0 f7f7c000-f7f7e000 rw-p 00000000 00:00 0 f7f7e000-f7f81000 r--p 00000000 00:00 0 [vvar] f7f81000-f7f83000 r-xp 00000000 00:00 0 [vdso] f7f83000-f7fa6000 r-xp 00000000 00:15 1511499 /usr/lib32/ld-2.25.so f7fa6000-f7fa7000 r--p 00022000 00:15 1511499 /usr/lib32/ld-2.25.so f7fa7000-f7fa8000 rw-p 00023000 00:15 1511499 /usr/lib32/ld-2.25.so fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
Are there VMA regions that don’t show up in maps
that still convince the kernel to reject the address? I looked at the occurrences of ENOMEM
in linux/mm/mmapc.
, but it’s a lot of code to read so maybe I missed something. Something that reserves some range of high addresses, or because it’s next to the stack?
Making the system calls in the other order doesn’t help (but PAGE_ALIGN and similar macros are written carefully to avoid wrapping around before masking, so that wasn’t likely anyway.)
Full source, compiled with gcc -O3 -fno-pie -no-pie -m32 address-wrap.c
:
#include <sys/mman.h> //void *mmap(void *addr, size_t len, int prot, int flags, // int fildes, off_t off); int main(void) { volatile unsigned *high = mmap((void*)-4096L, 4096, PROT_READ | PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); volatile unsigned *zeropage = mmap((void*)0, 4096, PROT_READ | PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); return (high == MAP_FAILED) ? 2 : *high; }
(I left out the part that tried to deref (int*)-2
because it just segfaults when mmap fails.)
Advertisement
Answer
The mmap function eventually calls either do_mmap or do_brk_flags which do the actual work of satisfying the memory allocation request. These functions in turn call get_unmapped_area. It is in that function that the checks are made to ensure that memory cannot be allocated beyond the user address space limit, which is defined by TASK_SIZE. I quote from the code:
* There are a few constraints that determine this: * * On Intel CPUs, if a SYSCALL instruction is at the highest canonical * address, then that syscall will enter the kernel with a * non-canonical return address, and SYSRET will explode dangerously. * We avoid this particular problem by preventing anything executable * from being mapped at the maximum canonical address. * * On AMD CPUs in the Ryzen family, there's a nasty bug in which the * CPUs malfunction if they execute code from the highest canonical page. * They'll speculate right off the end of the canonical space, and * bad things happen. This is worked around in the same way as the * Intel problem. #define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) #define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000) #define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? IA32_PAGE_OFFSET : TASK_SIZE_MAX)
On processors with 48-bit virtual address spaces, __VIRTUAL_MASK_SHIFT
is 47.
Note that TASK_SIZE
is specified depending on whether the current process is 32-bit on 32-bit, 32-bit on 64-bit, 64-bit on 64-bit. For 32-bit processes, two pages are reserved; one for the vsyscall page and the other used as a guard page. Essentially, the vsyscall page cannot be unmapped and so the highest address of the user address space is effectively 0xFFFFe000. For 64-bit processes, one guard page is reserved. These pages are only reserved on 64-bit Intel and AMD processors because only on these processors the SYSCALL
mechanism is used.
Here is the check that is performed in get_unmapped_area
:
if (addr > TASK_SIZE - len) return -ENOMEM;