I have a python3
script that operates with numpy.memmap
arrays. It writes an array to newly generated temporary file that is located in /tmp
:
import numpy, tempfile size = 2 ** 37 * 10 tmp = tempfile.NamedTemporaryFile('w+') array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size) array[0] = 666 array[size-1] = 777 del array array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size) print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'. format(tmp.name, len(array2), array2[0], array2[size-1])) while True: pass
The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp
, and the corresponding array still seems to be accessible. The output of the script is following:
File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777
The file really exists and is displayed as being 10T large:
$ ls -l /tmp/tmptjfwy8nr -rw------- 1 user user 10995116277760 Dec 1 15:50 /tmp/tmptjfwy8nr
However, the whole size of /tmp
is much smaller:
$ df -h /tmp Filesystem Size Used Avail Use% Mounted on /dev/sda1 235G 5.3G 218G 3% /
The process also is pretending to use 10T virtual memory, which is also not possible. The output of top
command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31622 user 20 0 10.000t 16592 4600 R 100.0 0.0 0:45.63 python3
As far as I understand, this means that during the call of numpy.memmap
the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.
Indeed, if I introduce the following in my code:
for i in range(size): array[i] = i
I get the error after a while:
Bus error (core dumped)
Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?
Advertisement
Answer
There’s nothing ‘bogus’ about the fact that you are generating 10 TB files
You are asking for arrays of size
2 ** 37 * 10 = 1374389534720 elements
A dtype of 'i8'
means an 8 byte (64 bit) integer, therefore your final array will have a size of
1374389534720 * 8 = 10995116277760 bytes
or
10995116277760 / 1E12 = 10.99511627776 TB
If you only have 250 GB of free disk space then how are you able to create a “10 TB” file?
Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.
For example, on my Linux machine I’m allowed to do something like this:
# I only have about 50GB of free space... ~$ df -h / Filesystem Type Size Used Avail Use% Mounted on /dev/sdb1 ext4 459G 383G 53G 88% / ~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s # ...but I can still generate a sparse file that reports its size as 10 TB ~$ ls -lah sparsefile -rw-rw-r-- 1 alistair alistair 10T Dec 1 21:17 sparsefile # however, this file uses zero bytes of "actual" disk space ~$ du -h sparsefile 0 sparsefile
Try calling du -h
on your np.memmap
file after it has been initialized to see how much actual disk space it uses.
As you start actually writing data to your np.memmap
file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error
. This means that if you needed to write < 250GB of data to your np.memmap
array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).
How is it possible for a process to use 10 TB of virtual memory?
When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.
How can you check whether you have enough disk space to store the full np.memmap
array?
I’m assuming that you want to do this programmatically in Python.
Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is
os.statvfs
:import os def get_free_bytes(path='/'): st = os.statvfs(path) return st.f_bavail * st.f_bsize print(get_free_bytes()) # 56224485376
Work out the size of your array in bytes:
import numpy as np def check_asize_bytes(shape, dtype): return np.prod(shape) * np.dtype(dtype).itemsize print(check_asize_bytes((2 ** 37 * 10,), 'i8')) # 10995116277760
Check whether 2. > 1.
Update: Is there a ‘safe’ way to allocate an np.memmap
file, which guarantees that sufficient disk space is reserved to store the full array?
One possibility might be to use fallocate
to pre-allocate the disk space, e.g.:
~$ fallocate -l 1G bigfile ~$ du -h bigfile 1.1G bigfile
You could call this from Python, for example using subprocess.check_call
:
import subprocess def fallocate(fname, length): return subprocess.check_call(['fallocate', '-l', str(length), fname]) def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs): nbytes = np.prod(shape) * np.dtype(dtype).itemsize fallocate(fname, nbytes) return np.memmap(fname, dtype, *args, shape=shape, **kwargs) mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024)) print(mmap.nbytes / 1E6) # 8.388608 print(subprocess.check_output(['du', '-h', 'test.mmap'])) # 8.0M test.mmap
I’m not aware of a platform-independent way to do this using the standard library, but there is a fallocate
Python module on PyPI that should work for any Posix-based OS.