How is context switch made in linux kernel when process exits before timer interrupt?
I know that if the process is running and timer interrupt occurs then schedule
function is called automatically if the flag is set, schedule function then selects next process to run. Basically in this case the schedule function runs in the context of current process but what happens when process exits even before timer interrupt? who calls schedule
function in this case? And in what context does it run?
Advertisement
Answer
It’s important to understand that the timer interrupt is just one of several hundred different reasons why schedule
might get called. Only programs whose runtime is dominated by computation, which is rarer than you’d think, ever exhaust their time slice. It’s much more common for programs to run for only a few microseconds — yes, microseconds — at a time, in between “blocking” in system calls, waiting for user input or whatever.
When a process exits in any way, ultimately a call to do_exit
always happens, in the (kernel) context of that process. do_exit
calls schedule
as its last action, and schedule
never returns to that context. Note how, at the very end of do_exit
, there is a call to schedule
, followed immediately by BUG();
and an infinite loop.
Just prior to this, do_exit
calls exit_notify
, which is responsible for sending SIGCHLD
to the parent process and/or releasing it from a call to wait
. So, a lot of the time, the parent process will be ready-to-run when schedule
gets called, and will be selected.
do_exit
also deallocates all of the user-space state and much of the kernel state associated with the process, frees memory, closes file descriptors, etc. The task_struct
itself must survive until someone calls wait
, and I can’t figure out exactly how the kernel decides that it can now be deallocated; this code is too convoluted.
If the process called _exit
, the kernel call chain is simply sys_exit_group
to do_group_exit
to do_exit
. If it took a fatal synchronous signal (e.g. SIGSEGV
), the call chain is a lot longer and has a tricky diversion in it. The hardware trap is fielded by architecture-specific code (e.g. x86 do_trap
) through force_sig_info
and send_signal
to complete_signal
, which adjusts the task state and then tells the scheduler to wake up the offending process. The offending process wakes up, and a maze of architecture-specific signal handling logic eventually delivers it to get_signal
, which calls do_group_exit
, which calls do_exit
. Fatal asynchronous signals (e.g. from typing kill 12345
at a shell prompt) start at sys_kill
and go through kill_something_info
, group_send_sig_info
, do_send_sig_info
to send_signal
, after which everything proceeds as above. In both cases, all of the steps up to complete_signal
may happen in any process context, but everything after “The offending process wakes up” happens in that process’s context.
The only parts of this description that are Linux-specific are the names of functions in the kernel’s code. Any implementation of Unix will have kernel functions that do more or less what Linux’s do_exit
and schedule
do, and the sequences of operations involved in fielding _exit
, fatal synchronous signals, and fatal async signals will be recognizably similar.