Skip to content
Advertisement

Reliable writes in linux

My requirement is to write a never ending stream of incoming variable sized binary messages to the file system. Messages of average size 2KB arrive at 1000 messages/sec. So in an hour, total number of messages would be 3600*1000*2 = 6.8 GB. The main purpose of the messages is following 1. Archive them for auditing purposes 2. Provide a search interface

My questions are

  1. Is there an open source software that solves this problem
  2. What kind of errors can occur if the process writes in multiples of block size and the process crashes in middle of writing the block
  3. What kind of errors can occur, where application has written a block size, but file system has not flushed the data to the disk.
  4. can inodes get corrupted in any scenario
  5. is there a file size limitation in linux?
  6. Is there an ideal file size ? What are pros and cons of large file (in GB) vs medium file (in MB)
  7. Any other things to watch for?
  8. My preference is to use C++, but if needed, i can switch to C.

Advertisement

Answer

Once write or writev returns (i.e. the OS has accepted it), the operating system is responsible for writing data to disk. It’s not your problem any more, and it’s happening irrespectively of your process crashing. Note that you have no control over the exact amount of data accepted or actually written at a time, nor whether it happens in multiples of filesystem blocks or whether it’s any particular size at all. You send a request to write and it tells you how much it actually accepted, and it will write that to disk, at its own discretion.
Probably this will happen in multiples of the block size because it makes sense for the OS to do that, but this is not guaranteed in any way (on many systems, Linux included, reading and writing is implemented via or tightly coupled with file mapping).

The same “don’t have to care” guarantee holds for file mapping (with the theoretical exception that a crashing application could in principle still write into a still mapped area, but once you’ve unmapped an area, that cannot happen even theoretically). Unless you pull the plug (or the kernel crashes), data will be written, and consistently.
Data will only ever be written in multiples of filesystem blocks, because memory pages are multiples of device blocks, and file mapping does not know anything else, it just works that way.

You can kind of (neglecting any possible unbuffered on-disk write cache) get some control over what’s on the disk with fdatasync. When that function returns, what has been in the buffers before has been sent to the disk.
However, that still doesn’t prevent your process from crashing in another thread in the mean time, and it doesn’t prevent someone from pulling the plug. fdatasync is preferrable over fsync since it doesn’t touch anything near the inode, meaning it’s faster and safer (you may lose the last data written in a subsequent crash since the length has not been updated yet, but you should never destroy/corrupt the whole file).

C library functions (fwrite) do their own buffering and give you control over the amount of data you write, but having “written” data only means it is stored in a buffer owned by the C library (in your process). If the process dies, the data is gone. No control over how the data hits the disk, or if ever. (N.b.: You do have some control insofar as you can fflush, this will immediately pass the contents of the buffers to the underlying write function, most likely writev, before returning. With that, you’re back at the first paragraph.)

Asynchronous IO (kernel aio) will bypass kernel buffers and usually pull the data directly from your process. Your process dies, your data is gone. Glibc aio uses threads that block on write, the same as in paragraph 1 applies.

What happens if you pull the plug or hit the “off” switch at any time? Nobody knows.
Usually some data will be lost, an operating system can give many guarantees, but it can’t do magic. Though in theory, you might have a system that buffers RAM with a battery or a system that has a huge dedicated disk cache which is also battery powered. Nobody can tell. In any case, plan for losing data.
That said, what’s once written should not normally get corrupted if you keep appending to a file (though, really anything can happen, and “should not” does not mean a lot).

All in all, using either write in append mode or file mapping should be good enough, they’re as good as you can get anyway. Other than sudden power loss, they’re reliable and efficient.
If power failure is an issue, an UPS will give better guarantees than any software solution can provide.

As for file sizes, I don’t see any reason to artificially limit file sizes (assuming a reasonably new filesystem). Usual file size limits for “standard” Linux filesystems (if there is any such thing) are in the terabyte range.
Either way, if you feel uneasy with the idea that corrupting one file for whatever reason could destroy 30 days worth of data, start a new file once every day. It doesn’t cost extra.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement