Simple division of labour over threads is not reducing the time taken

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.

What I have attempted to do is:

do a task X times in one thread – check the time taken.
do a task X / Y times in Y threads – check the time taken.

So if 1 thread takes T seconds to do 100’000’000 iterations of “work” then I would expect:

2 threads doing 50’000’000 iterations each would take ~ T / 2 seconds
3 threads doing 33’333’333 iterations each would take ~ T / 3 seconds

and so on until I reach some threading limit (number of cores or whatever).

So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.

1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s

So clearly something is not right here!

I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr

Finally here is the code for reference:

#include <iostream>
#include <math.h>
#include <vector>
#include <thread>

#define randf() ((double) rand()) / ((double) (RAND_MAX))

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    for (auto i = 0u; i < interations; i++)
    {
        double value = randf();
        double calc = std::atan(value);
        (void) calc;
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << std::endl;
}

int main()
{
    // Note these numbers vary by about probably due to godbolt servers load (?)
    // 1 Threads takes: ~8s
    // 2 Threads takes: ~8s
    // 3 Threads takes: ~8s
    uint32_t num_threads = 3; // Max 3 in godbolt
    uint32_t total_work = 100'000'000;

    // Seed rand
    std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "n==========================n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release – no difference.

I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can’t see an obvious (to me) bottle neck:

CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)

Questions are:

Why might this be happening?
What am I doing wrong?
How can I improve this?

Answer

The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.

You can use the facilities of the <random> header and have each thread use it’s own engine to generate the random numbers.

Advertisement

Answer