I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <assert.h> #include <stdint.h> void* aligned_malloc(size_t size, size_t alignment) { uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t)); uintptr_t t = r + sizeof(uintptr_t); uintptr_t o =(t + alignment) & ~(uintptr_t)alignment; if (!r) return NULL; ((uintptr_t*)o)[-1] = r; return (void*)o; } int main(int argc, char* argv[]) { printf("Starting calculation...n"); int i; const int length = 65536; unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64); unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64); unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64); for(i=0; i<length; i++){ A[i] = 1; B[i] = 2; } const int AVXLength = length / 16; unsigned char * pA = (unsigned char *) A; unsigned char * pB = (unsigned char *) B; unsigned char * pC = (unsigned char *) C; for(i=0; i<AVXLength; i++ ){ __asm__("vmovdqa32 %1,%%zmm0n" "vmovdqa32 %2,%%zmm1n" "vpaddd %0,%%zmm0,%%zmm1;" : "=m" (pC) : "m" (pA), "m" (pB)); pA += 64; pB += 64; pC += 64; } // To prove that the program actually worked for (i=0; i <5 ; i++) { printf("C[%d] = %fn", i, C[i]); } }
However when I run the program, I’ve got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Advertisement
Answer
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP’s case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc
which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc
and then scp
the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don’t have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it’s good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;"
produces assembly with the semicolon. I don’t think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h> void foo(int *A, int *B, int *C) { __m512i a16 = _mm512_load_epi32(A); __m512i b16 = _mm512_load_epi32(B); __m512i s16 = _mm512_add_epi32(a16,b16); _mm512_store_epi32(C, s16); }
and gcc -mavx512f -O3 -S knc.c
procudes
vmovdqa64 (%rsi), %zmm0 vpaddd (%rdi), %zmm0, %zmm0 vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64
instead of vmovdqa32
even though the Intel documentaion says it should be vmovdqa32
. I am not sure why. I don’t know what the difference is. I could have used the intrinsic _mm512_load_si512
which does exist and according to Intel should map vmovdqa32
but GCC maps it to vmovdqa64
as well. I am not sure why there are also _mm512_load_epi32
and _mm512_load_epi64
now. SSE and AVX don’t have these corresponding intrinsics.
Based on GCC’s code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0n" "vpaddd (%2), %%zmm0, %%zmm0n" "vmovdqa64 %%zmm0, (%0)" : : "r" (pC), "r" (pA), "r" (pB) : "memory" );
Maybe vmovdqa32
should be used instead of vmovdqa64
but I expect it does not matter.
I used the register modifier r
instead of the memory modifier m
because from past experience m
the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc
is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32 _mm512_load/store(u)_epi64 _mm512_load/store(u)_si512
is that the parameters have been converted to void*
. For example with SSE you have to cast
int *x; __m128i v; __mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x; __m512i; __mm512_store_epi32(x,v); //__mm512_store_si512(x,v); //this is also fine
It’s still not clear to me why there is vmovdqa32
and vmovdqa64
(GCC only seems to use vmovdqa64
currently) but it’s probably similar to movaps
and movapd
in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32
and vmovdqa64
is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32 _mm512_mask_load/store_epi64
Without masks the instructions are equivalent.