This is a bit of a meta question since I think I have a solution that works for me, but it has its own downsides and upsides. I need to do a fairly common thing, catch SIGSEGV
on a thread (no dedicated crash handling thread), dump some debug information and exit.
The catch here is the fact that upon crash, my application runs llvm-symbolizer
which takes a while (relatively speaking) and causes a yield (either because of clone + execve
or exceeding the time quanta for the thread, I’ve seen latter happen when doing symbolication myself in-process using libLLVM
). The reason for doing all this is to get a stack trace with demangled symbols and with line/file information (stored in a separate DWP file). For obvious reasons I do not want a yield happening across my SIGSEGV
handler since I intend to terminate the application (thread group) after it has executed and never return from the signal handler.
I’m not that familiar with Linux signal handling and with glibc’s wrappers doing magic around them, though, I know the basic gotchas but there isn’t much information on the specifics of handling signals like whether synchronous signal handlers get any kind of special priority in terms of scheduling.
Brainstorming, I had a few ideas and downsides to them:
pthread_kill(<every other thread>, SIGSTOP)
– Cumbersome with more threads, interacts with signal handlers which seems like it could have unintended side effects. Also requires intercepting thread creation from other libraries to keep track of the thread list and an increasing chance of pre-emption with every system call. Possibly even change their contexts once they’re stopped to point to a syscallexit
stub or flat out useSIGKILL
.- Global flag to serve as cancellation points for all thread (kinda like
pthread_cancel/pthread_testcancel
). Safer but requires a lot of maintenance and across a large codebase it can be hellish, in addition to a a mild performance overhead. Global flag could also cause the error to cascade since the program is already in an unpredictable state so letting any other thread run there is already not great. - “Abusing” the scheduler which is my current pick, with my implementation as one of the answers. Switching to
FIFO
scheduling policy and raising priority therefore becoming the only runnable thread in that group. - Core dumps not an option since the goal here was to avoid them in the first place. I would prefer not requiring a helper program aside from from the symbolizer as well.
Environment is a typical glibc
based Linux (4.4) distribution with NPTL
.
I know that crash handlers are fairly common now so I believe none of the ways I picked are that great, especially considering I’ve never seen the scheduler “hack” ever get used that way. So with that, does anyone have a better alternative that is cleaner and less riskier than the scheduler “hack” and am I missing any important points in my general ideas about signals?
Edit: It seems that I haven’t really considered MP in this equation (as per comments) and the fact that other threads are still runnable in an MP situation and can happily continue running alongside the FIFO
thread on a different processor. I can however change the affinity of the process to only execute on the same core as the crashing thread, which basically will effectively freeze all other threads at schedule boundaries. However, that still leaves the “FIFO thread yielding due to blocking IO” scenario open.
It seems like the FIFO + SIGSTOP
option is the best one, though I do wonder if there are any other tricks that can make a thread unschedulable short of using SIGSTOP
. From the docuemntation it seems like it’s not possible to set a thread’s CPU affinity to zero (leaving it in a limbo state where it’s technically runnable except no processors are available for it to run on).
Advertisement
Answer
This is the best solution I could come up (parts omitted for brevity but it shows the principle) with, my basic assumption being that in this situation the process runs as root. This approach can lead to resource starvation in case things go really bad and requires privileges (if I understand the man(7) sched
page correctly) I run the part of the signal handler that causes preemptions under the OSSplHigh
guard and exit the scope as soon as I can. This is not strictly C++ related since the same could be done in C or any other native language.
void spl_get(spl_t& O) { os_assert(syscall(__NR_sched_getattr, 0, &O, sizeof(spl_t), 0) == 0); } void spl_set(spl_t& N) { os_assert(syscall(__NR_sched_setattr, 0, &N, 0) == 0); } void splx(uint32_t PRI, spl_t& O) { spl_t PL = {0}; PL.size = sizeof(PL); PL.sched_policy = SCHED_FIFO; PL.sched_priority = PRI; spl_set(PL, O); } class OSSplHigh { os::spl_t OldPrioLevel; public: OSSplHigh() { os::splx(2, OldPrioLevel); } ~OSSplHigh() { os::spl_set(OldPrioLevel); } };
The handler itself is quite trivial using sigaltstack
and sigaction
though I do not block SIGSEGV
on any thread. Also oddly enough syscalls sched_setattr and sched_getattr or the struct definition weren’t exposed through glibc contrary to the documentation.
Late Edit: The best solution involved sending SIGSTOP
to all threads (by intercepting pthread_create
via linker’s --wrap
option) to keep a ledger of all running threads, thank you to suggestion in the comments.