SSE: unaligned load and store that crosses page boundary

Question

I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to

Accepted Answer

Page-line splits are bad for performance, but don&#8217;t affect correctness of unaligned accesses.  It is enough to make sure you don&#8217;t read past the end of the buffer, when you know the length ahead of time.For correctness, you often need to worry about it when implementing something like strlen, where your loop stops when you find a sentinel value.  That value could be at any position within your vector, so just doing 16B unaligned loads will read past the end of the array.  If the terminating 0 is in the last byte of one page, and the next page is not readable, and your current-position pointer is unaligned, a load that includes the 0 byte will also include bytes from the unreadable page, so it will fault.One solution is to do scalar until your pointer is aligned, then load aligned vectors.  An aligned load always comes entirely from one page, and also from one cache-line.  So even though you will read some bytes past the end of the string, you are guaranteed not to fault.  Valgrind might be unhappy about it, though, but standard library strlen implementations use this.Instead of scalar until an aligned pointer, you could do an unaligned vector from the start of the string (as long as that won&#8217;t cross a page-line), and then do aligned loads.  The first aligned load will overlap the first unaligned load, but that&#8217;s totally fine for a function like strlen that doesn&#8217;t care if it sees the same data twice.It might be worth avoiding page-line splits for performance reasons.  Even if you know your src pointer is misaligned, it&#8217;s often faster to let the hardware handle cache-line splits.  But before Skylake, page-splits have an extra ~100c latency.  (Down to 5c in Skylake).  If you have multiple pointers that can be aligned differently relative to each other, you can&#8217;t always just use a prologue to align your src.  (e.g. c[i] = a[i] + b[i], and c is aligned but b isn&#8217;t.)In that case, it might be worth using a branch to do aligned loads from before and after the page split, and combine them with palignr.A branch mispredict (~15c) is cheaper than the page-split latency, but delays everything (not just the load).  So it might also not be worth it, depending on the hardware and ratio of computation to memory access.If you&#8217;re writing a function that is usually called with aligned pointers, it makes sense to just use unaligned load/store instructions.  Any prologue to detect misalignment is just extra overhead for the already-aligned case, and on modern hardware (Nehalem and newer), unaligned loads on address that turn out to be aligned at runtime have identical performance to aligned load instructions.  (But you need AVX for unaligned loads to fold into other instructions as memory operands.  e.g. vpxor xmm0, xmm1, [rsi])By adding code to handle misaligned inputs, you&#8217;re slowing down the common aligned case to speed up the uncommon misaligned case.  Fast hardware support for unaligned loads/stores lets software just leave that to the hardware for the few cases where it does happen.(If misaligned inputs are common, then it is worth it to use a prologue to align your input pointer, esp. if you&#8217;re using AVX.  Sequential 32B AVX loads will cache-line split every other load.)See Agner Fog&#8217;s Optimizing Assembly guide for more info, and other links in the x86 tag wiki.

Advertisement

Answer