Skip to content
Advertisement

Explain Linux commit message that patches/secures POP SS followed by a #BP interrupt (INT3)

This is in reference to CVE-2018-8897 (which appears related to CVE-2018-1087), described as follows:

A statement in the System Programming Guide of the Intel 64 and IA-32 Architectures Software Developer’s Manual (SDM) was mishandled in the development of some or all operating-system kernels, resulting in unexpected behavior for #DB exceptions that are deferred by MOV SS or POP SS, as demonstrated by (for example) privilege escalation in Windows, macOS, some Xen configurations, or FreeBSD, or a Linux kernel crash. The MOV to SS and POP SS instructions inhibit interrupts (including NMIs), data breakpoints, and single step trap exceptions until the instruction boundary following the next instruction (SDM Vol. 3A; section 6.8.3). (The inhibited data breakpoints are those on memory accessed by the MOV to SS or POP to SS instruction itself.) Note that debug exceptions are not inhibited by the interrupt enable (EFLAGS.IF) system flag (SDM Vol. 3A; section 2.3). If the instruction following the MOV to SS or POP to SS instruction is an instruction like SYSCALL, SYSENTER, INT 3, etc. that transfers control to the operating system at CPL < 3, the debug exception is delivered after the transfer to CPL < 3 is complete. OS kernels may not expect this order of events and may therefore experience unexpected behavior when it occurs.

When reading this related git commit to the Linux kernel, I noted that the commit message states:

x86/entry/64: Don’t use IST entry for #BP stack

There’s nothing IST-worthy about #BP/int3. We don’t allow kprobes in the small handful of places in the kernel that run at CPL0 with an invalid stack, and 32-bit kernels have used normal interrupt gates for #BP forever.

Furthermore, we don’t allow kprobes in places that have usergs while in kernel mode, so “paranoid” is also unnecessary.

In light of the vulnerability, I’m trying to understand the last sentence/paragraph in the commit message. I understand that an IST entry refers to one of the (allegedly) “known good” stack pointers in the Interrupt Stack Table that can be used to handle interrupts. I also understand that #BP refers to a breakpoint exception (equivalent to INT3), and that kprobes is the debugging mechanism that is claimed to only run in a few places in the kernel at ring 0 (CPL0) privilege level.

But I’m completely lost in the next part, which may be because “usergs” is a typo and I’m simply missing what was intended:

Furthermore, we don’t allow kprobes in places that have usergs while in kernel mode, so “paranoid” is also unnecessary.

What does this statement mean?

Advertisement

Answer

usergs is referring to the x86-64 swapgs instruction, which exchanges gs with an internal saved GS value for the kernel to find the kernel stack from a syscall entry point. The swaps also swap the cached gsbase segment info, rather than reloading from the GDT based on the gs value itself. (wrgsbase can change the GS base independently of the GDT/LDT)

AMD’s design is that syscall doesn’t change RSP to point to the kernel stack, and doesn’t read/write any memory, so syscall itself can be fast. But then you enter the kernel with all registers holding their user-space values. See Why does Windows64 use a different calling convention from all other OSes on x86-64? for some links to mailing list discussions between kernel devs and AMD architects in ~2000, tweaking the design of syscall and swapgs to make it usable before any AMD64 CPUs were sold.


Apparently keeping track of whether GS is currently the kernel or user value is tricky for error handling: There’s no way to say “I want kernelgs now”; you have to know whether to run swapgs or not in any error-handling path. The only instruction is a swap, not a set it to one vs. the other.

Read comments in arch/x86/entry/entry_64.S e.g. https://github.com/torvalds/linux/blob/9fb71c2f230df44bdd237e9a4457849a3909017d/arch/x86/entry/entry_64.S#L1267 (from current Linux) which mentions usergs, and the next block of comments describes doing a swapgs before jumping to some error handling code with kernel gsbase.

IIRC, the Linux kernel [gs:0] holds a thread info block, at the lowest addresses of the kernel stack for that thread. The block includes the kernel stack pointer (as an absolute address, not relative to gs).

I wouldn’t be surprised if this bug is basically tricking the kernel to loading kernel rsp from a user-controlled gsbase, or otherwise screwing up the dead-reckoning of swapgs so it has the wrong gs at some point.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement