Here is a question about the details happening during system call.
However, one thing surprises me that the TSS maintains different stacks for different priviliges. That is, codes running in user mode and system mode are using different stack context.
Since system call is actually a function call, why couldn’t we just reuse the user stack and just create a new stack frame for that?
Advertisement
Answer
System calls and function calls are two completely different things. Kernel space and user space are two separate worlds, and you want to keep them separate as much as possible, both for simplicity and for security. Kernel operations must be transparent to user programs, and kernel data must remain visible to the kernel only. A syscall is not just a function call, the work done by the kernel needs to stay invisible to the user program.
The simplest reason for using separate stacks is that they actually belong to two different programs: one is the user space program, the other is the operating system. These stacks have different sizes and offsets, and therefore are much simpler to be managed separately. Imagine what would happen if the user program were to make a syscall in a leaf function that has exhausted all the available stack: the kernel would either fail to execute it and crash, or it would have to detect this and avoid crashing by allocating more space on top of the current stack. This is much more complicated than keeping the two stacks separate, and it would be harder to keep it transparent to the user program (that is, unless you then revert those changes, which is complicating the situation even more).
In addition to the above, treating a syscall like a simple function call, thus making the kernel use the same stack as the calling program, would leave a lot of kernel data on the stack of the program (basically every single local variable used by kernel functions while processing the syscall). This is a side effect that should not happen, as it would expose a lot of kernel data and addresses to userspace, which is also a security problem. If the kernel were to use the same stack as the user program that issued the syscall, every single kernel function would need to “clean out” all the local variables used before returning (and even clean out the saved return addresses on the stack), which would make the entire operating system a lot slower, and would also still not be transparent to the user program.