Reason for collapse of memory bandwidth when 2KB of data is cached in L1-cache

Question

In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question): BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code

Accepted Answer

I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.I don&#8217;t have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.The blue line in the graph corresponds to your measurement (left axis):The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.For further information on the erratic behavior before the drop see the comments made by @PeterCordes below.In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:void operator()(){    for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){         result+=mem[i];         result+=mem[i+BLOCK_SIZE];         result+=mem[i+2*BLOCK_SIZE];         result+=mem[i+3*BLOCK_SIZE];    }}Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.As you can see the fraction of branch mispredictions didn&#8217;t really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.

Advertisement

Answer