What is the maximum number of threads that std::async will create and execute asynchronously?

I have a large number (>>100K) of tasks with very high latency (minutes) and very little resource consumption. Potentially they could all be executed in parallel and I was considering using std::async to generate one future for each task.

My question is: what is the maximum number of threads that std::async will create and execute asynchronously? (using g++ 6.x on Ubuntu 16-xx or CentOs 7.x – x86_64)

It is important for me to get that number right because if I do not have enough tasks actually running (waiting) in parallel the cumulative cost of latency will be very high.

To get to an answer, I started by checking the capabilities of the system:

bob@vb:~/programming/cxx/async$ ulimit -u
43735
bob@vb:~/programming/cxx/async$ cat /proc/sys/kernel/threads-max 
87470

From these numbers, I was expecting to be able to get in the order of 43K threads running (mostly waiting) in parallel. To verify that, I wrote the program below to check the number of distinct thread ids and the time required to call 100K std::async with an empty task:

#include <thread>
#include <future>
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <string>

std::thread::id foo()
{
    using namespace std::chrono_literals;
    //std::this_thread::sleep_for(2s);
    return std::this_thread::get_id();
}

int main(int argc, char **argv)
{
    if (2 != argc) exit(1);
    const size_t COUNT = std::stoi(argv[1]);
    std::vector<decltype(std::async(foo))> futures;
    futures.reserve(COUNT);
    while (futures.capacity() != futures.size())
    { 
        futures.push_back(std::async(foo));
    } 
    std::vector<std::thread::id> ids;
    ids.reserve(futures.size());
    for (auto &f: futures)
    { 
        ids.push_back(f.get());
    } 
    std::sort(ids.begin(), ids.end());
    const auto end = std::unique(ids.begin(), ids.end());
    ids.erase(end, ids.end());
    std:: cerr << "COUNT: " << COUNT << ": ids.size(): " << ids.size() << std::endl;
}

The time was fine but the number of distinct thread ids was much less than expected (32748 instead of 43735):

bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 100000
COUNT: 100000: ids.size(): 32748
0:03.29

Then I un-commented the sleep line in foo to add a 2s sleeping time. The resulting timings are consistent with 2s up to 10K tasks or so, but at some point beyond that, some tasks end-up sharing the same thread id and the elapsed time increases by 2s for each additional task:

bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10056
COUNT: 10056: ids.size(): 10056
0:02.24
bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10057
COUNT: 10057: ids.size(): 10057
0:04.27
bob@vb:~/programming/cxx/async$ /usr/bin/time -f "%E" ./testAsync 10058
COUNT: 10058: ids.size(): 10057
0:06.28
bob@vb:~/programming/cxx/async$ ps -eT | wc -l
277

So, it looks that for my problem, on this system, the limit is in the order of 10K. I checked on another system and the limit was in the order of 4K.

I can’t figure out:

why these values are so small
how to predict these values from the specs of the system

Answer

With g++ on linux, the straightforward answer seems to be “the maximum number of threads that can be created before pthread_create fails and returns EAGAIN”. That number can be limited by several different values and man pthread_create lists 3 of them:

RLIMIT_NPROC:soft resource limit (4096 on my CentOs 7 server and 43735 on my Ubuntu/VirtualBox laptop)
the value of /proc/sys/kernel/threads-max (2061857 and 87470 resp.)
the value of /proc/sys/kernel/pid_max (40960 and 32768 resp.)

There is at least one other possible limit imposed by systemd, as man logind.conf indicates:

UserTasksMax= Sets the maximum number of OS tasks each user may run concurrently. This controls the TasksMax= setting of the per-user slice unit, see systemd.resource-control(5) for details. Defaults to 33%, which equals 10813 with the kernel’s defaults on the host, but might be smaller in OS containers.

Advertisement

Answer