Skip to content
Advertisement

High performance reading – linux/pthreads

I have moderately large binary file consisting of independent blocks like this:

header1
data1
header2
data2
header3
data3
...

The number of blocks, the size of each block and the total size of the file vary quite a lot, but typical numbers are ~1000 blocks and average blocksize 100kb. The files are generated by an external application which I have no control over, but I want to read them as fast as possible. In many cases I am only interested in a fraction (i.e. 10 %) of the blocks, and this is the case I will optimize for.

My current implementation is like this:

  1. Open the file and read all the headers – using information in the header to fseek() to the next header location; retain an open FILE * pointer.
  2. When data is requested use fseek() to locate the data block, read all the data and return it.

This works fine – but I was thinking maybe(?) it was possible to speed things up using e.g. aio, mmap or other techniques I have only heard of.

Any thoughts?

Joakim

Advertisement

Answer

Most of the time is probably spent in accessing the disk. So perhaps buying an SSD is sensible. (Whatever you do, your application is I/O bound).

Apparently, your file is only about 100Mb. You could get it on disk (kernel file) cache just by reading it, e.g. with cat yourfile > /dev/null before running your program. For such a small file (on a reasonable machine it fits in RAM), I won’t worry that much.

You could pre-process the text file, e.g. to make a database (for sqlite, or a real RDBMS like PostGreSQL) or just a gdbm indexed file.

If using <stdio.h> you might have a bigger buffer with setbuffer, or call fopen with a "rmt" mode (the m is a GNU Glibc extension to ask mmap-ing it).

You could use mmap with madvise.

You could (perhaps in a separate thread) use the readahead syscall.

But your file seems small enough that you should not bother that much. Are you sure it is really a performance issue? Do you read that file many thousand times per day, or do you have many hundreds of such files?

Advertisement