optimizing awk command for large file

Question

I have these functions to process a 2GB text file. I'm splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours. What else can I try make the script faster? A bit of details: I feed my input csv into a while loop to be read line by line. I grabbed the values from the

Accepted Answer

The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let’s do some quick, back-of-the-envelope calculations.Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.And, while Linux admirably handles fork and exec as efficiently as possible, it’s not without cost:pax$ time for i in {1..44000000} ; do true ; donereal 1m0.946sIn addition, bash hasn’t really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:time ( cat somefile >/dev/null )time ( while read -r x ; do echo $x >/dev/null ; done 0) { $25 = sprintf(FMT, priceOutbound) $26 = sprintf(FMT, priceOutbound + tax / 2) } else { $25 = sprintf(FMT, priceExc / 2) $26 = sprintf(FMT, priceInc / 2) } } print}You just run that single awk script with:awk -F'","' -f prog.awk data.txtWith the test data you provided, here’s the before and after, with markers for field numbers 25 and 26: <-25-> <-26->"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125""111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"

Advertisement

Answer