I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
- do a task X times in one thread – check the time taken.
- do a task X / Y times in Y threads – check the time taken.
So if 1 thread takes T seconds to do 100’000’000 iterations of “work” then I would expect:
- 2 threads doing 50’000’000 iterations each would take ~ T / 2 seconds
- 3 threads doing 33’333’333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
- 1 Threads took: ~6.5s
- 2 Threads took: ~6.7s
- 3 Threads took: ~13.3s
- 8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream> #include <math.h> #include <vector> #include <thread> #define randf() ((double) rand()) / ((double) (RAND_MAX)) void thread_func(uint32_t interations, uint32_t thread_id) { // Print the thread id / workload std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl; // Get the start time auto start = std::chrono::high_resolution_clock::now(); // do some work for the required number of interations for (auto i = 0u; i < interations; i++) { double value = randf(); double calc = std::atan(value); (void) calc; } // Get the time taken auto total_time = std::chrono::high_resolution_clock::now() - start; // Print it out std::cout << "thread: " << thread_id << " finished after: " << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count() << "ms" << std::endl; } int main() { // Note these numbers vary by about probably due to godbolt servers load (?) // 1 Threads takes: ~8s // 2 Threads takes: ~8s // 3 Threads takes: ~8s uint32_t num_threads = 3; // Max 3 in godbolt uint32_t total_work = 100'000'000; // Seed rand std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count())); // Store the start time auto overall_start = std::chrono::high_resolution_clock::now(); // Start all the threads doing work std::vector<std::thread> task_list; for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++) { task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); })); } // Wait for the threads to finish for (auto &task : task_list) { task.join(); } // Get the end time and print it auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start; std::cout << "n==========================n" << "thread overall_total_time time: " << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count() << "ms" << std::endl; return 0; }
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release – no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can’t see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources)
: I have 8 coresMemory bound (needs lots of RAM resources)
: I have assigned my VM 10GB ram, running nothing elseI/O bound (Network and/or hard drive resources)
: No network trafic involved- There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
- Why might this be happening?
- What am I doing wrong?
- How can I improve this?
Advertisement
Answer
The rand
function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand
, the performance suffers with multiple threads.
You can use the facilities of the <random>
header and have each thread use it’s own engine to generate the random numbers.