I posted this issue already yesterday, but wasnt well received, though I have solid repro now, please bear with me. Here are system specs: Tesla K20m with 331.67 driver, CUDA 6.0, Linux machine. Now I have a global memory read heavy application therefore I tried to optimize it using __ldg instruction on every single place where I am reading global