I just finished installing a desktop computer based on an AMD Ryzen 2700x and 32GB RAM (running Ubuntu 18.04). At work, I have a 3-year-old laptop workstation with an Intel i7-6820HQ and 16GB RAM (running Windows 10).
I installed Anaconda on both platforms and ran a custom Python code which relies heavily on basic numpy matrix operations. The code does not involve any GPU-specific computation (my work laptop does not have any). The Ryzen is running at 3.7GHz, the laptop i7 is running at 3.6GHz. Both systems have been fully updated.
To my surprise, the code runs in 5 minutes on my work laptop, while it requires 10 minutes on the Ryzen desktop!
The latest Ryzen 2700x is supposed to be much faster than a high-end 3-year-old laptop Intel processor, then why would it be 2x slower?
Is it due to Ubuntu being sub-optimal in some way as opposed to Windows 10 for the Ryzen?
Is it due to Intel being more adequate to Python simulations than AMD?
Anything else?
Thanks for your help in understanding what is going on.
Advertisement
Answer
numpy matrix operations
Intel Skylake has significantly better FMA throughput (2 per clock 256-bit vector) than Ryzen (2 per clock 128-bit vector or 1 per clock 256-bit vector). See https://agner.org/optimize/ for x86 microarch details. And FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 for a summary including Ryzen.
With data hot in cache, which a well-optimized matmul can achieve with cache-blocking, a good matmul can bottleneck on FMA execution unit throughput.
Or L1d SIMD load/store bandwidth, where Skylake > 2x Ryzen, being able to sustain close to 2x 256-bit load + 1x 256-bit store, while Ryzen can sustain 2x 128-bit cache accesses, up to one of which can be a store.
So it’s totally reasonable for the single-threaded or per-core throughput for Intel to be twice that of a Ryzen core, for matmul / FMA throughput.
Are you multi-threading to take advantage of all cores in each machine? 2700x is an 8-core CPU, while 6820HQ is a 4-core chip.
If your workload can / is taking advantage of multiple cores, then maybe it’s an L3 cache bandwidth limitation that’s making the difference, assuming they’re both configured correctly and actually running at 3.6 / 3.7 GHz. Or maybe there’s something creating a 4x per-core perf difference.