Skip to content
Advertisement

Random mmaped memory access up to 16% slower than heap data access

Our software builds a data structure in memory that is about 80 gigabytes large. It can then either use this data structure directly to do its computation, or dump it to disk so it can be reused several times afterwards. A lot of random memory accesses happens in this data structure.

For larger input this data structure can grow even larger (our largest one was over 300 gigabytes large) and our servers have enough memory to hold everything in RAM.

If the data structure is dumped to disk, it gets loaded back into the address space with mmap, forced into the os page cache, and lastly mlocked (code at the end).

The problem is that there is about a 16% difference in performance between just using the computed data structure immediately on the heap (see Malloc version), or mmaping the dumped file (see mmap version ). I don’t have a good explanation why this is the case. Is there a way to find out why mmap is being so much slower? Can I close this performance gap somehow?

I did the measurements on a server running Scientific Linux 7.2 with a 3.10 kernel, it has 128GB RAM (enough to fit everything), and repeated them several times with similar results. Sometimes the gap is a bit smaller, but not by much.

New Update (2017/05/23):

I produced a minimal test case, where the effect can be seen. I tried the different flags (MAP_SHARED etc.) without success. The mmap version is still slower.

JavaScript

Please excuse the C++, its random class is easier to use. I compiled it like this:

JavaScript

On this server I get the following times (ran all of the commands a few times):

JavaScript

Advertisement

Answer

malloc() back-end can make use of THP (Transparent Huge Pages), which is something not possible when using mmap() backed by a file.

Using huge pages (even transparently) can reduce drastically the number of TLB misses while running your application.

An interesting test could be to disable transparent hugepages and run your malloc() test again. echo never > /sys/kernel/mm/transparent_hugepage/enabled

You could also measure TLB misses using perf:

perf stat -e dTLB-load-misses,iTLB-load-misses ./command

For more infos on THP please see: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

People are waiting for a long time to have a page cache which is huge page capable, allowing the mapping of files using huge pages (or a mix of huge pages and standard 4K pages). There are a bunch of articles on LWN about transparent huge page cache, but it does not have reached production kernel yet.

Transparent huge pages in the page cache (May 2016): https://lwn.net/Articles/686690

There is also a presentation from January this year about the future of Linux page cache: https://youtube.com/watch?v=xxWaa-lPR-8

Additionally, you can avoid all those calls to mlock on individual pages in your mmap() implementation by using the MAP_LOCKED flag. If you are not privileged, this may require to adjust the memlock limit.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement