numpy.memmap: bogus memory allocation

Question

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp: The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp, and the corresponding array still seems to be accessible. The output of the script is following: The file

Accepted Answer

There&#8217;s nothing &#8216;bogus&#8217; about the fact that you are generating 10 TB filesYou are asking for arrays of size  2 ** 37 * 10 = 1374389534720 elementsA dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of  1374389534720 * 8 = 10995116277760 bytesor  10995116277760 / 1E12 = 10.99511627776 TBIf you only have 250 GB of free disk space then how are you able to create a &#8220;10 TB&#8221; file?Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.For example, on my Linux machine I&#8217;m allowed to do something like this:# I only have about 50GB of free space...~$ df -h /Filesystem     Type  Size  Used Avail Use% Mounted on/dev/sdb1      ext4  459G  383G   53G  88% /~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T0+0 records in0+0 records out0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s# ...but I can still generate a sparse file that reports its size as 10 TB~$ ls -lah sparsefile-rw-rw-r-- 1 alistair alistair 10T Dec  1 21:17 sparsefile# however, this file uses zero bytes of "actual" disk space~$ du -h sparsefile0       sparsefileTry calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).How is it possible for a process to use 10 TB of virtual memory?When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.How can you check whether you have enough disk space to store the full np.memmap array?I&#8217;m assuming that you want to do this programmatically in Python.Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:import osdef get_free_bytes(path='/'):    st = os.statvfs(path)    return st.f_bavail * st.f_bsizeprint(get_free_bytes())# 56224485376Work out the size of your array in bytes:import numpy as npdef check_asize_bytes(shape, dtype):    return np.prod(shape) * np.dtype(dtype).itemsizeprint(check_asize_bytes((2 ** 37 * 10,), 'i8'))# 10995116277760Check whether 2. > 1.Update: Is there a &#8216;safe&#8217; way to allocate an np.memmap file, which guarantees that sufficient disk space is reserved to store the full array?One possibility might be to use fallocate to pre-allocate the disk space, e.g.:~$ fallocate -l 1G bigfile~$ du -h bigfile1.1G    bigfileYou could call this from Python, for example using subprocess.check_call:import subprocessdef fallocate(fname, length):    return subprocess.check_call(['fallocate', '-l', str(length), fname])def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):    nbytes = np.prod(shape) * np.dtype(dtype).itemsize    fallocate(fname, nbytes)    return np.memmap(fname, dtype, *args, shape=shape, **kwargs)mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))print(mmap.nbytes / 1E6)# 8.388608print(subprocess.check_output(['du', '-h', 'test.mmap']))# 8.0M    test.mmapI&#8217;m not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.

numpy.memmap: bogus memory allocation

Advertisement

Answer

There’s nothing ‘bogus’ about the fact that you are generating 10 TB files

If you only have 250 GB of free disk space then how are you able to create a “10 TB” file?

How is it possible for a process to use 10 TB of virtual memory?

How can you check whether you have enough disk space to store the full `np.memmap` array?

Update: Is there a ‘safe’ way to allocate an `np.memmap` file, which guarantees that sufficient disk space is reserved to store the full array?