Skip to content
Advertisement

What is the performance penalty of C++11 thread_local variables in GCC 4.8?

From the GCC 4.8 draft changelog:

G++ now implements the C++11 thread_local keyword; this differs from the GNU __thread keyword primarily in that it allows dynamic initialization and destruction semantics. Unfortunately, this support requires a run-time penalty for references to non-function-local thread_local variables even if they don’t need dynamic initialization, so users may want to continue to use __thread for TLS variables with static initialization semantics.

What is precisely the nature and origin of this run-time penalty?

Obviously to support non-function-local thread_local variables there needs to be a thread initialization phase before the entry to every thread main (just as there is a static initialization phase for global variables), but are they referring to some run-time penalty beyond that?

Roughly speaking what is the architecture of gcc’s new implementation of thread_local?

Advertisement

Answer

(Disclaimer: I don’t know much about the internals of GCC, so this is also an educated guess.)

The dynamic thread_local initialization is added in commit 462819c. One of the change is:

* semantics.c (finish_id_expression): Replace use of thread_local
variable with a call to its wrapper.

So the run-time penalty is that, every reference of the thread_local variable will become a function call. Let’s check with a simple test case:

JavaScript

When compiled*, we see that every use of the tls reference becomes a function call to _ZTW3tls, which lazily initialize the the variable once:

JavaScript

Compare it with the __thread version, which won’t have this extra wrapper:

JavaScript

This wrapper is not needed for in every use case of thread_local though. This can be revealed from decl2.c. The wrapper is generated only when:

  • It is not function-local, and,

    1. It is extern (the example shown above), or
    2. The type has a non-trivial destructor (which is not allowed for __thread variables), or
    3. The type variable is initialized by a non-constant-expression (which is also not allowed for __thread variables).

In all other use cases, it behaves the same as __thread. That means, unless you have some extern __thread variables, you could replace all __thread by thread_local without any loss of performance.


*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement