I posted this issue already yesterday, but wasnt well received, though I have solid repro now, please bear with me. Here are system specs: Tesla K20m with 331.67 driver, CUDA 6.0, Linux machine. Now I have a global memory read heavy application therefore I tried to optimize it using __ldg instruction on every single place where I am reading global

__ldg causes slower execution time in certain situation

I posted this issue already yesterday, but wasnt well received, though I have solid repro now, please bear with me. Here are system specs:

Tesla K20m with 331.67 driver,
CUDA 6.0,
Linux machine.

Now I have a global memory read heavy application therefore I tried to optimize it using __ldg instruction on every single place where I am reading global memory. However, __ldg did not improve performance at all, running time decreased roughly 4x. So my question is, how comes that replacing glob_mem[index] with __ldg(glob_mem + index) can possibly result into decreased performance? Here is a primitive version of my problem for you to reproduce:

MAKE

CPP=g++
CPPFLAGS=-Wall -O4 -std=c++0x -lcudart -lcurand
LIBDIRS=/usr/local/cuda/lib64
NVCC=nvcc
NVCCINCLUDE=/usr/local/cuda/include
NVCC_COMPILER_FLAGS=-Iinclude/ -O4 -arch compute_35 -code sm_35 -c
TARGET=example

.PHONY: all clear clean purge

all: $(TARGET)

$(TARGET): kernel.o main.cpp
    @echo Linking executable "$(TARGET)" ...
    @$(CPP) $(CPPFLAGS) $(addprefix -I,$(NVCCINCLUDE)) $(addprefix -L,$(LIBDIRS)) -o $@ $^

kernel.o: kernel.cu
    @echo Compiling "$@" ...
    $(NVCC) $(addprefix -I,$(NVCCINCLUDE)) $(NVCC_COMPILER_FLAGS) $< -o $@

clean: clear

clear:
    @echo Removing object files ...
    -@rm -f *.o

purge: clear
    @echo Removing executable ...
    -@rm -f $(TARGET)

main.cpp

#include <chrono>
#include <cstdio>

#include "kernel.cuh"

using namespace std;

int main()
{
    auto start = chrono::high_resolution_clock::now();
    double result = GetResult();
    auto elapsed = chrono::high_resolution_clock::now() - start;

    printf("%.3f, elapsed time: %.3f n", result, (double)chrono::duration_cast<std::chrono::microseconds>(elapsed).count());
    return 0;
}

kernel.cuh

#ifndef kernel_cuh
#define kernel_cuh

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

double GetResult();

#endif

kernel.cu

#include "kernel.cuh"

class DeviceClass
{
    double* d_a;
public:
    __device__ DeviceClass(double* a)
        : d_a(a) {}

    __device__ void foo(double* b, const int count)
    {
        int tid = threadIdx.x + (blockDim.x * blockIdx.x);
        double result = 0.0;
        for (int i = 0; i < count; ++i)
        {
            result += d_a[i];
            //result += __ldg(d_a + i);
        }

        b[tid] = result;
    }
};

__global__ void naive_kernel(double* c, const int count, DeviceClass** deviceClass)
{
    (*deviceClass)->foo(c, count);
}

__global__ void create_device_class(double* a, DeviceClass** deviceClass)
{
    (*deviceClass) = new DeviceClass(a);
}

double GetResult()
{
    const int aSize = 8388608;
    const int gridSize = 8;
    const int blockSize = 1024;

    double* h_a = new double[aSize];
    for (int i = 0; i <aSize; ++i)
    {
        h_a[i] = aSize - i;
    }

    double* d_a;
    cudaMalloc((void**)&d_a, aSize * sizeof(double));
    cudaMemcpy(d_a, h_a, aSize * sizeof(double), cudaMemcpyHostToDevice);

    double* d_b;
    cudaMalloc((void**)&d_b, gridSize * blockSize * sizeof(double));

    DeviceClass** d_devicesClasses;
    cudaMalloc(&d_devicesClasses, sizeof(DeviceClass**));
    create_device_class<<<1,1>>>(d_a, d_devicesClasses);

    naive_kernel<<<gridSize, blockSize>>>(d_b, aSize, d_devicesClasses);
    cudaDeviceSynchronize();

    double h_b;
    cudaMemcpy(&h_b, d_b, sizeof(double), cudaMemcpyDeviceToHost);

    cudaFree(d_a);
    cudaFree(d_b);
    return h_b;
}

So what is it all about… In my application I have some global data pointed to by member variable of class DeviceClass which is created on device, exactly as new/delete CUDA demo shows.

Build this using make and then execute ./example,
Running this example as is yields: “35184376283136.000, elapsed time: 2054676.000”.
After I uncomment line 17 in kernel.cu and comment out line right above it the result becomes: “35184376283136.000, elapsed time: 3288975.000”
so using __ldg decreases performance quite significantly even though I was using it up until now without any issues on different occasions. What could be the cause?

Answer

The reason for the version using __ldg being slower is the fact that the NVCC compiler is not able to perform loop unrolling optimizations correctly in this particular scenario. The issue was submitted to NVIDIA with ID 1605303. The most recent response from NVIDIA team is as follows:

Although, we have not communicated this to you, we have done prior investigations into your issue. The solution to your issue is improving our loop-unrolling heuristic in the back-end compiler – the compiler embedded inside of ptxas. We evaluated the possibility of resolving this issue in CUDA 8.0, but an initial solution that fixed your problem caused unacceptable regressions. Because of other constraints, we were not able to develop a proper solution in time for it to be done for CUDA 8.0.

We are actively working to resolve your issue in the future CUDA release(that following of CUDA 8.0). We will ensure that we keep you informed of our progress moving forward.

Advertisement

Answer