Our software builds a data structure in memory that is about 80 gigabytes large. It can then either use this data structure directly to do its computation, or dump it to disk so it can be reused several times afterwards. A lot of random memory accesses happens in this data structure.

For larger input this data structure can grow even larger (our largest one was over 300 gigabytes large) and our servers have enough memory to hold everything in RAM.

If the data structure is dumped to disk, it gets loaded back into the address space with mmap, forced into the os page cache, and lastly mlocked (code at the end).

The problem is that there is about a 16% difference in performance between just using the computed data structure immediately on the heap (see Malloc version), or mmaping the dumped file (see mmap version ). I don’t have a good explanation why this is the case. Is there a way to find out why mmap is being so much slower? Can I close this performance gap somehow?

I did the measurements on a server running Scientific Linux 7.2 with a 3.10 kernel, it has 128GB RAM (enough to fit everything), and repeated them several times with similar results. Sometimes the gap is a bit smaller, but not by much.

New Update (2017/05/23):

I produced a minimal test case, where the effect can be seen. I tried the different flags (MAP_SHARED etc.) without success. The mmap version is still slower.

#include <random>
#include <iostream>
#include <sys/time.h>
#include <ctime>
#include <omp.h>
#include <sys/mman.h>
#include <unistd.h>

constexpr size_t ipow(int base, int exponent) {
    size_t res = 1;
    for (int i = 0; i < exponent; i++) {
        res = res * base;
    }
    return res;
}

size_t getTime() {
    struct timeval tv;

    gettimeofday(&tv, NULL);
    size_t ret = tv.tv_usec;
    ret /= 1000;
    ret += (tv.tv_sec * 1000);

    return ret;
}

const size_t N = 1000000000;
const size_t tableSize = ipow(21, 6);

size_t* getOffset(std::mt19937 &generator) {
    std::uniform_int_distribution<size_t> distribution(0, N);
    std::cout << "Offset Array" << std::endl;
    size_t r1 = getTime();
    size_t *offset = (size_t*) malloc(sizeof(size_t) * tableSize);
    for (size_t i = 0; i < tableSize; ++i) {
        offset[i] = distribution(generator);
    }
    size_t r2 = getTime();
    std::cout << (r2 - r1) << std::endl;

    return offset;
}

char* getData(std::mt19937 &generator) {
    std::uniform_int_distribution<char> datadist(1, 10);
    std::cout << "Data Array" << std::endl;
    size_t o1 = getTime();
    char *data = (char*) malloc(sizeof(char) * N);
    for (size_t i = 0; i < N; ++i) {
        data[i] = datadist(generator);  
    }
    size_t o2 = getTime();
    std::cout << (o2 - o1) << std::endl;

    return data;
}

template<typename T>
void dump(const char* filename, T* data, size_t count) {
    FILE *file = fopen(filename, "wb");
    fwrite(data, sizeof(T), count, file); 
    fclose(file);
}

template<typename T>
T* read(const char* filename, size_t count) {
#ifdef MMAP
    FILE *file = fopen(filename, "rb");
    int fd =  fileno(file);
    T *data = (T*) mmap(NULL, sizeof(T) * count, PROT_READ, MAP_SHARED | MAP_NORESERVE, fd, 0);
    size_t pageSize = sysconf(_SC_PAGE_SIZE);
    char bytes = 0;
    for(size_t i = 0; i < (sizeof(T) * count); i+=pageSize){
        bytes ^= ((char*)data)[i];
    }
    mlock(((char*)data), sizeof(T) * count);
    std::cout << bytes;
#else
    T* data = (T*) malloc(sizeof(T) * count);
    FILE *file = fopen(filename, "rb");
    fread(data, sizeof(T), count, file); 
    fclose(file);
#endif
    return data;
}

int main (int argc, char** argv) {
#ifdef DATAGEN
    std::mt19937 generator(42);
    size_t *offset = getOffset(generator);
    dump<size_t>("offset.bin", offset, tableSize);

    char* data = getData(generator);
    dump<char>("data.bin", data, N);
#else
    size_t *offset = read<size_t>("offset.bin", tableSize); 
    char *data = read<char>("data.bin", N); 
    #ifdef MADV
        posix_madvise(offset, sizeof(size_t) * tableSize, POSIX_MADV_SEQUENTIAL);
        posix_madvise(data, sizeof(char) * N, POSIX_MADV_RANDOM);
    #endif
#endif

    const size_t R = 10; 
    std::cout << "Computing" << std::endl;
    size_t t1 = getTime();
    size_t result = 0;
#pragma omp parallel reduction(+:result)
    {
        size_t magic = 0;
        for (int r = 0; r < R; ++r) {
#pragma omp for schedule(dynamic, 1000)
            for (size_t i = 0; i < tableSize; ++i) {
                char val = data[offset[i]];
                magic += val;
            }
        }
        result += magic;
    }
    size_t t2 = getTime();

    std::cout << result << "t" << (t2 - t1) << std::endl;
}

Please excuse the C++, its random class is easier to use. I compiled it like this:

#  The version that writes down the .bin files and also computes on the heap
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DDATAGEN
# The mmap version
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DMMAP
# The fread/heap version
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native
# For madvice add -DMADV

On this server I get the following times (ran all of the commands a few times):

./mmap
2030ms

./fread
1350ms

./mmap+madv
2030ms

./fread+madv
1350ms

numactl --cpunodebind=0 ./mmap 
2600 ms

numactl --cpunodebind=0 ./fread 
1500 ms

Answer

malloc() back-end can make use of THP (Transparent Huge Pages), which is something not possible when using mmap() backed by a file.

Using huge pages (even transparently) can reduce drastically the number of TLB misses while running your application.

An interesting test could be to disable transparent hugepages and run your malloc() test again. echo never > /sys/kernel/mm/transparent_hugepage/enabled

You could also measure TLB misses using perf:

perf stat -e dTLB-load-misses,iTLB-load-misses ./command

For more infos on THP please see: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

People are waiting for a long time to have a page cache which is huge page capable, allowing the mapping of files using huge pages (or a mix of huge pages and standard 4K pages). There are a bunch of articles on LWN about transparent huge page cache, but it does not have reached production kernel yet.

Transparent huge pages in the page cache (May 2016): https://lwn.net/Articles/686690

There is also a presentation from January this year about the future of Linux page cache: https://youtube.com/watch?v=xxWaa-lPR-8

Additionally, you can avoid all those calls to mlock on individual pages in your mmap() implementation by using the MAP_LOCKED flag. If you are not privileged, this may require to adjust the memlock limit.

Random mmaped memory access up to 16% slower than heap data access

New Update (2017/05/23):

Advertisement

Answer