mmap: performance when using multithreading

Question

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it. The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 pr…

Accepted Answer

Is there anything about mmap() I am not aware of when used in multithreaded environment?Yes.  mmap() requires significant virtual memory manipulation &#8211; effectively single-threading your process in places.  Per this post from one Linus Torvalds:  &#8230; playing games with the virtual memory mapping is very expensive  in itself.  It has a number of quite real disadvantages that people tend  to ignore because memory copying is seen as something very slow, and  sometimes optimizing that copy away is seen as an obvious improvment.     Downsides to mmap:      quite noticeable setup and teardown costs. And I mean noticeable.  It&#8217;s things like following the page tables to unmap everything  cleanly. It&#8217;s the book-keeping for maintaining a list of all the  mappings. It&#8217;s The TLB flush needed after unmapping stuff.   page faulting is expensive. That&#8217;s how the mapping gets populated,  and it&#8217;s quite slow.   Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.So the virtual memory manipulations mapping files requires are not only expensive, they really can&#8217;t be done in parallel &#8211; there&#8217;s only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can&#8217;t parallelize changes to a process&#8217;s virtual address space.You&#8217;d almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s).  You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.Reusing the buffer also completely eliminates any munmap() or delete/free().You&#8217;d have to manage the buffers, though.  Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?As far as    If so, why do 2 processes have better performance?The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.

Advertisement

Answer