I’m looking to do some very basic micro benchmarking of small code paths, such as tight loops, that I’ve written in C++. I’m running on Linux and OSX, and using GCC. What facilities are there for sub millisecond accuracy? I am thinking a simple test of running the code path many times (several tens of millions?) will give me enough consistency to get a good reading. If anyone knows of preferable methods, please feel free to suggest them.
Advertisement
Answer
You can use "rdtsc"
processor instruction on x86/x86_64. For multicore systems check the “constant_tsc” capability in CPUID (/proc/cpuinfo in linux) – it will mean that all cores use the same tick counter, even with dynamic freq changing and sleeping.
If your processor does not support constant_tsc, be sure to bind you programm to the core (taskset
utility in Linux).
When using rdtsc on out-of-order CPUs (All besides Intel Atom, may be some other low-end cpus), add an “ordering” instruction before, e.g. “cpuid” – it will temporary disable instruction reordering.
Also, MacOsX has “Shark” which can measure some hardware events in your code.
RDTSC
and out-of-order CPUs. More info in section 18 of the 2nd great Fog’s manual on optimization: Optimizing subroutines in assembly language: An optimization guide for x86 platforms (the main site with all the five manuals is http://www.agner.org/optimize/)
http://www.scribd.com/doc/1548519/optimizing-assembly
On all processors with out-of-order execution, you have to insert XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.