Excessive disk writes when using numpy.memmap

Question

I have implemented a file-backed HashTable using numpy.memmap. It appears to be functioning correctly, however, I notice that on Linux both KSysGuard and SMART are reporting ridiculous IO Write amounts. About 50x the amount of data that should be written. I have not tested this on other operating systems. This is the code that creates the internal memory map And

Accepted Answer

memmap work by mapping pages in virtual memory (typically to physical memory pages or storage device ones like in your case). On most platforms, the size of pages is at least 4 KiB. As a result, any write in a page may cause the whole page to be updated.SSDs and more generally flash memory works using block too, but they often use bigger chunks. Indeed, flash memory use cells with a very limited number of writes (eg. 1000). When cells are too much overwritten, they get unstable and may not be able to be read/written correctly. As a result, flash storage devices avoid any direct write access to cells and move written data blocks at a new location to save cells while being relatively fast. Once written, blocks cannot be mutated: a new block needs to be allocated and written to replace the old one. Thus, writing only few bytes randomly on a flash storage device cause it to allocate a lot of new blocks and copy a lot of (unchanged) data chunks. This also significantly impact the life of the target storage device.This can explain why SMART informations report such a high amount of IO writes.Note that HDD do not have this issue but random writes are very slow compared to SSD (due to the time to move the heads). Alternative non-volatile RAM like ferroelectric RAM or magneto-resistive RAM can solve this issue correctly. Unfortunately, such RAMs are relatively experimental currently.A possible fix is to store the modified data block in RAM, sort the block by location and write all of them at once. If the dataset is huge and the writes are very spread uniformly, then there is no solution on current mainstream hardware.

Advertisement

Answer