Summary
I am trying to understand the limits of my compute resources when performing multiple simulations. My task is trivial in terms of parallelisation – I need to run a large number of simple independent simulations, i.e. each simulation program does not rely on another for information. Each simulation has roughly the same running time. For this purpose I have created an experiment that is detailed below.
Details
I have two shell scripts located in the same directory.
First script called simple
:
#!/bin/bash # Simple Script echo "Running sleep with arg= $1 " sleep 5s echo "Finished sleeping with arg= $1"
Second script called runall
:
#!/bin/bash export PATH="$PATH:./" # Fork off a new process for each program by running in background # Run N processes at a time and wait until all of them have finished # before executing the next batch. This is sub-optimal if the running # time of each process varies significantly. # Note if the number of total processes is not divisible by the alloted pool something weird happens echo "Executing runall script..." for ARG in $(seq 600); do simple $ARG & NPROC=$(($NPROC+1)) if [ "$NPROC" -ge 300 ]; then wait echo "New batch" NPROC=0 fi done
Here are some specs on my computer (MAC OS X):
$ ulimit -u 709 $ sysctl hw.ncpu hw.ncpu: 8 $ sysctl hw.physicalcpu hw.physicalcpu: 4
From this I interpret that I have 709 processes at my disposal and 8 processor cores available.
However when I execute $ ./runall
I eventually end up with:
... Running sleep with arg= 253 Running sleep with arg= 254 Running sleep with arg= 255 Running sleep with arg= 256 Running sleep with arg= 257 Running sleep with arg= 258 ./runall: fork: Resource temporarily unavailable Running sleep with arg= 259 ./simple: fork: Resource temporarily unavailable Running sleep with arg= 260 $ Running sleep with arg= 261 Finished sleeping with arg= 5 Finished sleeping with arg= 7 Finished sleeping with arg= 4 Finished sleeping with arg= 8 Finished sleeping with arg= 3 ...
SO:
Question 1
Does this mean that out of the 709 processes available, only 258 can be dedicated to my runall
program, the rest remaining probably being used by other processes on my computer?
Question 2
I substituted the simple
script with something else which does something more complicated than just sleep (it reads a file and processes the data in the file to create a graph) and now I start to notice some differences. With the help of using $ time ./runall
I can get the total run time and whereas before when calling simple
for up to the 258 processes I always got a run time of about 5s:
real 0m5.071s user 0m0.184s sys 0m0.263s
i.e, running many simulations in parallel gives the same runtime as a single simulation. However now that I am calling a more complex program instead of simple
I get a longer total run time than the single simulation time (calling a single simulation takes 1.5s whereas 20 simulations in parallel takes about 8.5s). How do I explain this behavior?
Question 3 Im not sure how the number of processor cores is related to the parallel performance – Since I have 8 cores at my disposal I thought I would be able to run 8 programs in parallel at the same time it would take me to just run one. Im not sure about my reasoning on this…
Advertisement
Answer
If you have 8 cpu threads available, and your programs consume 100% of a single CPU, it does not make sense to run more than 8 programs at a time.
If your programs are multi-threaded, then you may want to have fewer than 8 processes running at a time. If your programs occasionally use less than 100% of a single CPU (perhaps if they’re waiting for IO), then you may want to run more than 8 processes at a time.
Even if the process limit for your user is extremely high, other resources could be exhausted much sooner – for instance, RAM. If you launch 200 processes and they exhaust RAM, then the operating system will respond by satisfying requests for RAM by swapping out some other process’s RAM to disk; and now the computer needlessly crawls to a halt because 200 processes are waiting on IO to get their memory back from disk, only to have it be written out again because some other process wants to run. This is called thrashing.
If your goal is to perform some batch computation, it does not make sense to load the computer any more than enough processes to keep all CPU cores at 100% utilization. Anything more is waste.
Edit – Clarification on terminology.
- A single computer can have more than one CPU socket.
- A single CPU can have more than one CPU core.
- A single CPU core can support simultaneous execution of more than one stream of instructions. Hyperthreading is an example of this.
- A stream of instructions is what we typically call a “thread”, either in the context of the operating system, processes, or in the CPU.
So I could have a computer with 2 sockets, with each socket containing a 4-core CPU, where each of those CPUs supports hyperthreading and thus supports two threads per core.
Such a computer could execute 2 * 4 * 2 = 16 threads simultaneously.
A single process can have as many threads as it wants, until some resources is exhausted – raw RAM, internal operating system data structures, etc. Each process has at least one thread.
It’s important to note that tricks like hyperthreading may not scale performance linearly. When you have unhyperthreaded CPU cores, those cores contain enough parts to be able to execute a single stream of instructions all by itself; aside from memory access, it doesn’t share anything with the rest of the other cores, and so performance can scale linearly.
However, each core has a lot of parts – and during some types of computations, some of those parts are inactive while others are active. And during other types of computations could be the opposite. Doing a lot of floating-point math? Well, then the integer math unit in the core might be idle. Doing a lot of integer math? Well, then the floating-point math unit might be idle.
Hyperthreading seeks to increase perform, even if only a little bit, by exploiting these temporarily unused units within a core; while the floating point unit is busy, schedule something that can use the integer unit.
…
As far as the operating system is concerned when it comes to scheduling is how many threads across all processes are runnable. If I have one process with 3 runnable threads, a second process with one runnable thread, and a third process with 10 runnable threads, then the OS will want to run a total of 3 + 1 + 10 = 14 threads.
If there are more runnable program threads than there are CPU execution threads, then the operating system will run as many as it can, and the others will sit there doing nothing, waiting. Meanwhile, those programs and those threads may have allocated a bunch of memory.
Lets say I have a computer with 128 GB of RAM and CPU resources such that the hardware can execute a total of 16 threads at the same time. I have a program that uses 2 GB of memory to perform a simple simulation, and that program only creates one thread to perform its execution, and each program needs 100s of CPU time to finish. What would happen if I were to try to run 16 instances of that program at the same time?
Each program would allocate 2 GB * 16 = 32 GB of ram to hold its state, and then begin performing its calculations. Since each program creates a single thread, and there are 16 CPU execution threads available, every program can run on the CPU without competing for CPU time. The total time we’d need to wait for the whole batch to finish would be 100 s: 16 processes / 16 cpu execution threads * 100s.
Now what if I increase that to 32 programs running at the same time? Well, we’ll allocate a total of 64GB of RAM, and at any one point in time, only 16 of them will be running. This is fine, nothing bad will happen because we’ve not exhausted RAM (and presumably any other resource), and the programs will all run efficiently and eventually finish. Runtime will be approximately twice as long at 200s.
Ok, now what happens if we try to run 128 programs at the same time? We’ll run out of memory: 128 * 2 = 256 GB of ram, more than double what the hardware has. The operating system will respond by swapping memory to dis and reading it back in as needed, but it’ll have to do this very frequently, and it’ll have to wait for the disk.
If you had enough ram, this would run in 800s (128 / 16 * 100). Since you don’t, it’s very possible it could take an order of magnitude longer.