Skip to content
Advertisement

how to choose chunk size when reading a large file?

I know that reading a file with chunk size that is multiple of filesystem block size is better.

1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb. Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?

I mean yes if I am reading 1tb file than this definitely makes a difference.

The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?

2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size? I know available memory is a limitation but apart from that what other factors affect my choice?

Thanks a lot in advance for all the answers.

Advertisement

Answer

Theorically, the fastest I/O could occur when the buffer is page-aligned, and when its size is a multiple of the system block size.

If the file was stored continuously on the hard disk, the fastest I/O throughput would be attained by reading cylinder by cylinder. (There could even not be any latency then, since when you read a whole track you don’t need to start from the start, you can start in the middle, and loop over). Unfortunately nowadays it would be near impossible to do that, since the hard disk firmware hides the physical layout of the sectors, and may use replacement sectors needing even seeks while reading a single track. The OS file system may also try to spread the file blocks all over the disk (or at least, all over a cylinder group), to avoid having to do long seeks over big files when acccessing small files.

So instead of considering physical tracks, you may try to take into account the hard disk buffer size. Most hard disks have buffer size of 8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB should let the hard disk firmware optimize the throughput without stalling it’s buffer.

But then, if there are a lot of layers above, eg, a RAID, all bets are off.

Really, the best you can do is to benchmark your particular circumstances.

Advertisement