Vector Sum using AVX Inline Assembly on XeonPhi

Question

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code: However when I run the program, I&#8217;ve got segmentation fault from the asm part. Can someb…

Accepted Answer

Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP&#8217;s case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc  which is part of the many core software stack KNC which presumably generates the correct opcodes.  One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.As to why the OPs code is failing I can only make guess since I don&#8217;t have a KNC card to test with.In my limited experience with GCC inline assembly I have learned that it&#8217;s good to look at the generated assembly in the object file to make sure the compiler did what you expect.  When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don&#8217;t think the semicolon should be there. That could be one problem.But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly#include <x86intrin.h>void foo(int *A, int *B, int *C) {    __m512i a16 = _mm512_load_epi32(A);    __m512i b16 = _mm512_load_epi32(B);    __m512i s16 = _mm512_add_epi32(a16,b16);    _mm512_store_epi32(C, s16);}and gcc -mavx512f -O3 -S knc.c procudes vmovdqa64   (%rsi), %zmm0vpaddd      (%rdi), %zmm0, %zmm0vmovdqa64   %zmm0, (%rdx)GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don&#8217;t know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well.  I am not sure why there are also  _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don&#8217;t have these corresponding intrinsics.Based on GCC&#8217;s code here is the inline assembly I would use__asm__ ("vmovdqa64   (%1), %%zmm0n"        "vpaddd      (%2), %%zmm0, %%zmm0n"        "vmovdqa64   %%zmm0, (%0)"        :        : "r" (pC), "r" (pA), "r" (pB)        : "memory");Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary.  For examplegcc-5.1 -O3 -S foo.ck1om-mpss-linux-gcc foo.sThis may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.As explained here the reason the AVX512 intrinsics _mm512_load/store(u)_epi32_mm512_load/store(u)_epi64_mm512_load/store(u)_si512is that the parameters have been converted to void*.  For example with SSE you have to castint *x;__m128i v;__mm_store_si128((__m128*)x,v)whereas with SSE you no longer need toint *x;__m512i;__mm512_store_epi32(x,v);//__mm512_store_si512(x,v); //this is also fineIt&#8217;s still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but  it&#8217;s probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics_mm512_mask_load/store_epi32_mm512_mask_load/store_epi64Without masks the instructions are equivalent.

Advertisement

Answer