Skip to content
Advertisement

Vector Sum using AVX Inline Assembly on XeonPhi

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>

void* aligned_malloc(size_t size, size_t alignment) {

    uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
    uintptr_t t = r + sizeof(uintptr_t);
    uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
    if (!r) return NULL;
    ((uintptr_t*)o)[-1] = r;
    return (void*)o;
}

int main(int argc, char* argv[])
{
    printf("Starting calculation...n");
    int i;
    const int length = 65536;

    unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
    unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
    unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);

    for(i=0; i<length; i++){
            A[i] = 1;
            B[i] = 2;
    }

    const int AVXLength = length / 16;
    unsigned char * pA = (unsigned char *) A;
    unsigned char * pB = (unsigned char *) B;
    unsigned char * pC = (unsigned char *) C;
    for(i=0; i<AVXLength; i++ ){
            __asm__("vmovdqa32 %1,%%zmm0n"
                    "vmovdqa32 %2,%%zmm1n"
                    "vpaddd %0,%%zmm0,%%zmm1;"
            : "=m" (pC) : "m" (pA), "m" (pB));

            pA += 64;
            pB += 64;
            pC += 64;
    }

    // To prove that the program actually worked
    for (i=0; i <5 ; i++)
    {
            printf("C[%d] = %fn", i, C[i]);
    }

}

However when I run the program, I’ve got segmentation fault from the asm part. Can somebody help me with that???

Thanks

Advertisement

Answer

Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP’s case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.

The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.


As to why the OPs code is failing I can only make guess since I don’t have a KNC card to test with.

In my limited experience with GCC inline assembly I have learned that it’s good to look at the generated assembly in the object file to make sure the compiler did what you expect.

When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don’t think the semicolon should be there. That could be one problem.

But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly

#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

and gcc -mavx512f -O3 -S knc.c procudes

vmovdqa64   (%rsi), %zmm0
vpaddd      (%rdi), %zmm0, %zmm0
vmovdqa64   %zmm0, (%rdx)

GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don’t know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don’t have these corresponding intrinsics.

Based on GCC’s code here is the inline assembly I would use

__asm__ ("vmovdqa64   (%1), %%zmm0n"
        "vpaddd      (%2), %%zmm0, %%zmm0n"
        "vmovdqa64   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        : "memory"
);

Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.

I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.


Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example

gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s

This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.


As explained here the reason the AVX512 intrinsics

_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512

is that the parameters have been converted to void*. For example with SSE you have to cast

int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)

whereas with SSE you no longer need to

int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine

It’s still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it’s probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.


The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics

_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64

Without masks the instructions are equivalent.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement