Skip to content
Advertisement

Optimal way of writing to append-only files on an SSD

I want to know what’s the optimal way to log to an SSD. Think of something like a database log, where you’re writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.

I’m going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don’t Stack your Log on my Log [pdf].

SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls “unusable shit” in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.

As an asside, looking at my PC filesystem: C:WINDOWSsystem32>fsutil fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation (cluster) size. Neither of which map to the SSD page size – probably not very efficient.

There’s some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you’ll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.

Because of the above reasons, and because I need AIO for reads of the log anyway, I’m opting for writing to the log with O_DIRECT and O_DSYNC and having full control.

As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that’s not so bad. But my question is, wouldn’t it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?

That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don’t flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time – it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?

Advertisement

Answer

I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.

Short Answer

You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.

To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.

The process can be summarized as:

  • Read the whole page to a buffer
  • Modify the buffer with your added content
  • Erase the whole page
  • Rewrite the whole page with the modified buffer

Long Answer

The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.

As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100’000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it’s useless and at best reduces the lifetime).

Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.

Fat 32 format

Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it’s much smaller.

FAT32

Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.

On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.

Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.


Implementation

Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.

On the upside * Will save you quite some disk space * Will render the disk tolerant in case of failure if your design is smart enough * will require much fewer resources if you are on a limited system

On the downside * Much more work and debug * The drive won’t be natively recognized by the system.

If you only log, you don’t need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.

The implementation I’ve done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.

The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.

This provided a very robust solution as they are very less likely to have table corruption.

I would also suggest to buffer 512 bytes and write page by page.


Others

You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement