How to ensure a signal handler never yields to a thread within the same process group?

This is a bit of a meta question since I think I have a solution that works for me, but it has its own downsides and upsides. I need to do a fairly common thing, catch SIGSEGV on a thread (no dedicated crash handling thread), dump some debug information and exit.

The catch here is the fact that upon crash, my application runs llvm-symbolizer which takes a while (relatively speaking) and causes a yield (either because of clone + execve or exceeding the time quanta for the thread, I’ve seen latter happen when doing symbolication myself in-process using libLLVM). The reason for doing all this is to get a stack trace with demangled symbols and with line/file information (stored in a separate DWP file). For obvious reasons I do not want a yield happening across my SIGSEGV handler since I intend to terminate the application (thread group) after it has executed and never return from the signal handler.

I’m not that familiar with Linux signal handling and with glibc’s wrappers doing magic around them, though, I know the basic gotchas but there isn’t much information on the specifics of handling signals like whether synchronous signal handlers get any kind of special priority in terms of scheduling.

Brainstorming, I had a few ideas and downsides to them:

pthread_kill(<every other thread>, SIGSTOP) – Cumbersome with more threads, interacts with signal handlers which seems like it could have unintended side effects. Also requires intercepting thread creation from other libraries to keep track of the thread list and an increasing chance of pre-emption with every system call. Possibly even change their contexts once they’re stopped to point to a syscall exit stub or flat out use SIGKILL.
Global flag to serve as cancellation points for all thread (kinda like pthread_cancel/pthread_testcancel). Safer but requires a lot of maintenance and across a large codebase it can be hellish, in addition to a a mild performance overhead. Global flag could also cause the error to cascade since the program is already in an unpredictable state so letting any other thread run there is already not great.
“Abusing” the scheduler which is my current pick, with my implementation as one of the answers. Switching to FIFO scheduling policy and raising priority therefore becoming the only runnable thread in that group.
Core dumps not an option since the goal here was to avoid them in the first place. I would prefer not requiring a helper program aside from from the symbolizer as well.

Environment is a typical glibc based Linux (4.4) distribution with NPTL.

I know that crash handlers are fairly common now so I believe none of the ways I picked are that great, especially considering I’ve never seen the scheduler “hack” ever get used that way. So with that, does anyone have a better alternative that is cleaner and less riskier than the scheduler “hack” and am I missing any important points in my general ideas about signals?

Edit: It seems that I haven’t really considered MP in this equation (as per comments) and the fact that other threads are still runnable in an MP situation and can happily continue running alongside the FIFO thread on a different processor. I can however change the affinity of the process to only execute on the same core as the crashing thread, which basically will effectively freeze all other threads at schedule boundaries. However, that still leaves the “FIFO thread yielding due to blocking IO” scenario open.

It seems like the FIFO + SIGSTOP option is the best one, though I do wonder if there are any other tricks that can make a thread unschedulable short of using SIGSTOP. From the docuemntation it seems like it’s not possible to set a thread’s CPU affinity to zero (leaving it in a limbo state where it’s technically runnable except no processors are available for it to run on).

Answer

This is the best solution I could come up (parts omitted for brevity but it shows the principle) with, my basic assumption being that in this situation the process runs as root. This approach can lead to resource starvation in case things go really bad and requires privileges (if I understand the man(7) sched page correctly) I run the part of the signal handler that causes preemptions under the OSSplHigh guard and exit the scope as soon as I can. This is not strictly C++ related since the same could be done in C or any other native language.

void spl_get(spl_t& O)
{
    os_assert(syscall(__NR_sched_getattr,
        0, &O, sizeof(spl_t), 0) == 0);
}

void spl_set(spl_t& N)
{
    os_assert(syscall(__NR_sched_setattr,
        0, &N, 0) == 0);
}

void splx(uint32_t PRI, spl_t& O) {
    spl_t PL = {0};

    PL.size = sizeof(PL);
    PL.sched_policy = SCHED_FIFO;
    PL.sched_priority = PRI;

    spl_set(PL, O);
}

class OSSplHigh {
    os::spl_t OldPrioLevel;

public:
    OSSplHigh() {
        os::splx(2, OldPrioLevel);
    }

    ~OSSplHigh() {
        os::spl_set(OldPrioLevel);
    }
};

JavaScript
​x
 
void spl_get(spl_t& O){    os_assert(syscall(__NR_sched_getattr,        0, &O, sizeof(spl_t), 0) == 0);}​void spl_set(spl_t& N){    os_assert(syscall(__NR_sched_setattr,        0, &N, 0) == 0);}​void splx(uint32_t PRI, spl_t& O) {    spl_t PL = {0};​    PL.size = sizeof(PL);    PL.sched_policy = SCHED_FIFO;    PL.sched_priority = PRI;​    spl_set(PL, O);}​class OSSplHigh {    os::spl_t OldPrioLevel;​public:    OSSplHigh() {        os::splx(2, OldPrioLevel);    }​    ~OSSplHigh() {        os::spl_set(OldPrioLevel);    }};​

The handler itself is quite trivial using sigaltstack and sigaction though I do not block SIGSEGV on any thread. Also oddly enough syscalls sched_setattr and sched_getattr or the struct definition weren’t exposed through glibc contrary to the documentation.

Late Edit: The best solution involved sending SIGSTOP to all threads (by intercepting pthread_create via linker’s --wrap option) to keep a ledger of all running threads, thank you to suggestion in the comments.

Advertisement

Answer