Our software builds a data structure in memory that is about 80 gigabytes large. It can then either use this data structure directly to do its computation, or dump it to disk so it can be reused several times afterwards. A lot of random memory accesses happens in this data structure.
For larger input this data structure can grow even larger (our largest one was over 300 gigabytes large) and our servers have enough memory to hold everything in RAM.
If the data structure is dumped to disk, it gets loaded back into the address space with mmap, forced into the os page cache, and lastly mlocked (code at the end).
The problem is that there is about a 16% difference in performance between just using the computed data structure immediately on the heap (see Malloc version), or mmaping the dumped file (see mmap version ). I don’t have a good explanation why this is the case. Is there a way to find out why mmap is being so much slower? Can I close this performance gap somehow?
I did the measurements on a server running Scientific Linux 7.2 with a 3.10 kernel, it has 128GB RAM (enough to fit everything), and repeated them several times with similar results. Sometimes the gap is a bit smaller, but not by much.
New Update (2017/05/23):
I produced a minimal test case, where the effect can be seen. I tried the different flags (MAP_SHARED etc.) without success. The mmap version is still slower.
#include <random> #include <iostream> #include <sys/time.h> #include <ctime> #include <omp.h> #include <sys/mman.h> #include <unistd.h> constexpr size_t ipow(int base, int exponent) { size_t res = 1; for (int i = 0; i < exponent; i++) { res = res * base; } return res; } size_t getTime() { struct timeval tv; gettimeofday(&tv, NULL); size_t ret = tv.tv_usec; ret /= 1000; ret += (tv.tv_sec * 1000); return ret; } const size_t N = 1000000000; const size_t tableSize = ipow(21, 6); size_t* getOffset(std::mt19937 &generator) { std::uniform_int_distribution<size_t> distribution(0, N); std::cout << "Offset Array" << std::endl; size_t r1 = getTime(); size_t *offset = (size_t*) malloc(sizeof(size_t) * tableSize); for (size_t i = 0; i < tableSize; ++i) { offset[i] = distribution(generator); } size_t r2 = getTime(); std::cout << (r2 - r1) << std::endl; return offset; } char* getData(std::mt19937 &generator) { std::uniform_int_distribution<char> datadist(1, 10); std::cout << "Data Array" << std::endl; size_t o1 = getTime(); char *data = (char*) malloc(sizeof(char) * N); for (size_t i = 0; i < N; ++i) { data[i] = datadist(generator); } size_t o2 = getTime(); std::cout << (o2 - o1) << std::endl; return data; } template<typename T> void dump(const char* filename, T* data, size_t count) { FILE *file = fopen(filename, "wb"); fwrite(data, sizeof(T), count, file); fclose(file); } template<typename T> T* read(const char* filename, size_t count) { #ifdef MMAP FILE *file = fopen(filename, "rb"); int fd = fileno(file); T *data = (T*) mmap(NULL, sizeof(T) * count, PROT_READ, MAP_SHARED | MAP_NORESERVE, fd, 0); size_t pageSize = sysconf(_SC_PAGE_SIZE); char bytes = 0; for(size_t i = 0; i < (sizeof(T) * count); i+=pageSize){ bytes ^= ((char*)data)[i]; } mlock(((char*)data), sizeof(T) * count); std::cout << bytes; #else T* data = (T*) malloc(sizeof(T) * count); FILE *file = fopen(filename, "rb"); fread(data, sizeof(T), count, file); fclose(file); #endif return data; } int main (int argc, char** argv) { #ifdef DATAGEN std::mt19937 generator(42); size_t *offset = getOffset(generator); dump<size_t>("offset.bin", offset, tableSize); char* data = getData(generator); dump<char>("data.bin", data, N); #else size_t *offset = read<size_t>("offset.bin", tableSize); char *data = read<char>("data.bin", N); #ifdef MADV posix_madvise(offset, sizeof(size_t) * tableSize, POSIX_MADV_SEQUENTIAL); posix_madvise(data, sizeof(char) * N, POSIX_MADV_RANDOM); #endif #endif const size_t R = 10; std::cout << "Computing" << std::endl; size_t t1 = getTime(); size_t result = 0; #pragma omp parallel reduction(+:result) { size_t magic = 0; for (int r = 0; r < R; ++r) { #pragma omp for schedule(dynamic, 1000) for (size_t i = 0; i < tableSize; ++i) { char val = data[offset[i]]; magic += val; } } result += magic; } size_t t2 = getTime(); std::cout << result << "t" << (t2 - t1) << std::endl; }
Please excuse the C++, its random class is easier to use. I compiled it like this:
# The version that writes down the .bin files and also computes on the heap g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DDATAGEN # The mmap version g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DMMAP # The fread/heap version g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native # For madvice add -DMADV
On this server I get the following times (ran all of the commands a few times):
./mmap 2030ms ./fread 1350ms ./mmap+madv 2030ms ./fread+madv 1350ms numactl --cpunodebind=0 ./mmap 2600 ms numactl --cpunodebind=0 ./fread 1500 ms
Advertisement
Answer
malloc()
back-end can make use of THP (Transparent Huge Pages), which is something not possible when using mmap()
backed by a file.
Using huge pages (even transparently) can reduce drastically the number of TLB misses while running your application.
An interesting test could be to disable transparent hugepages and run your malloc()
test again.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
You could also measure TLB misses using perf
:
perf stat -e dTLB-load-misses,iTLB-load-misses ./command
For more infos on THP please see: https://www.kernel.org/doc/Documentation/vm/transhuge.txt
People are waiting for a long time to have a page cache which is huge page capable, allowing the mapping of files using huge pages (or a mix of huge pages and standard 4K pages). There are a bunch of articles on LWN about transparent huge page cache, but it does not have reached production kernel yet.
Transparent huge pages in the page cache (May 2016): https://lwn.net/Articles/686690
There is also a presentation from January this year about the future of Linux page cache: https://youtube.com/watch?v=xxWaa-lPR-8
Additionally, you can avoid all those calls to mlock on individual pages in your mmap()
implementation by using the MAP_LOCKED
flag.
If you are not privileged, this may require to adjust the memlock limit.