Skip to content
Advertisement

Is linux perf accurate for measuring cache misses for multithread C program?

Can linux perf measure cache misses for multithread program, or it can only report the result for master thread? I used it on a C program using pthread, it seemed the cache miss number was lower than the expected number.

Advertisement

Answer

Yes, perf stat is an accurate total across all threads. (Unless your CPU has an erratum where a certain PMU event over or under-counts. These do happen, more often than correctness bugs for actual architectural state, so check the errata sheet, aka “spec update” for Intel CPUs.)

Make sure you understand exactly what each cache event counts, though, e.g. L1d-misses counts l1d.replacement on a modern Intel like Skylake, so multiple misses on the same line are only one replacement. (How does Linux perf calculate the cache-references and cache-misses events).

Also note that HW prefetch can avoid a lot of misses for sequential access, if memory can keep up. Also related: L2 instruction fetch misses much higher than L1 instruction fetch misses


Also related: Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events goes into some detail about what exactly those specific events count.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement