I am currently working with CSV files that can be 10’s of GB in size, and need to edit the headers dynamically depending on the use case.
For this I am using:
sed -i '1,1s/id/id:ID(Person)/g' etc.
which has the desired effect of only editing the headers, but can take upwards of 10 seconds to complete. I imagine this is because the whole file is still being streamed, but I cannot find away to stop this from occuring.
Any ideas or a point in the right direction would be greatly appreciated.
Advertisement
Answer
sed
is not the problem. The problem is that you are streaming a 10GB file. If this is the only operation you are doing on it, sed
is probably not much worse than any other line based utility (awk
etc).
Perl may do a better job if you read the whole file first, but your memory footprint is going to be pretty big and depending on your system, you may start paging.
If it is something you are going to do frequently and for a long time, you may be able to do better in a lower level language by reading larger blocks of data, allowing the block layer to optimize your disk access for you. If you keep the “chunks” large enough for the block layer, but small enough to avoid paging, you should be able to hit the sweet spot.
Probably not worth it for a 1 off conversion.