I have these functions to process a 2GB text file. I’m splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.
What else can I try make the script faster?
A bit of details:
- I feed my input csv into a while loop to be read line by line.
- I grabbed the values from the csv line from 4 fields in the read2col function
- The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I’m rounding the result to 2 decimal places. Then, print the line to a text file.
Sample data:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Script:
read2col() { is_one_way=$(echo "$line"| awk -F'","' '{print $7}') price_outbound=$(echo "$line"| awk -F'","' '{print $30}') price_exc=$(echo "$line"| awk -F'","' '{print $25}') tax=$(echo "$line"| awk -F'","' '{print $27}') price_inc=$(echo "$line"| awk -F'","' '{print $26}') } ################################################# #for each line in the csv mainf() { cd $infarepath while read -r line; do #read the value of csv fields into variables read2col if [[ $is_one_way == 0 ]]; then if [[ $price_outbound > 0 ]]; then #calculate price inc and print the entire line to txt file echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp else #divide price ecx and inc by 2 if price outbound is not greater than 0 echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp fi else echo $line >>"$csvsplitfile".tmp fi done < $csvsplitfile }
Advertisement
Answer
The first thing you should do is stop invoking six subshells for running awk
for every single line of input. Let’s do some quick, back-of-the-envelope calculations.
Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.
And, while Linux admirably handles fork
and exec
as efficiently as possible, it’s not without cost:
pax$ time for i in {1..44000000} ; do true ; done real 1m0.946s
In addition, bash
hasn’t really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.
An analysis of the two methods of file processing (one program reading an entire file (each line has just hello
on it), and bash
reading it a line at a time) is shown below. The two commands used to get the timings were:
time ( cat somefile >/dev/null ) time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
For varying file sizes (user+sys
time, averaged over a few runs), it’s quite interesting:
# of lines cat-method while-method ---------- ---------- ------------ 1,000 0.375s 0.031s 10,000 0.391s 0.234s 100,000 0.406s 1.994s 1,000,000 0.391s 19.844s 10,000,000 0.375s 205.583s 44,000,000 0.453s 889.402s
From this, it appears that the while
method can hold its own for smaller data sets, it really does not scale well.
Since awk
itself has ways to do calculations and formatted output, processing the file with one single awk
script, rather than your bash
/multi-awk
-per-line combination, will make the cost of creating all those processes and line-based delays go away.
This script would be a good first attempt, let’s call it prog.awk
:
BEGIN { FMT = "%.2f" OFS = FS } { isOneWay=$7 priceOutbound=$30 priceExc=$25 tax=$27 priceInc=$26 if (isOneWay == 0) { if (priceOutbound > 0) { $25 = sprintf(FMT, priceOutbound) $26 = sprintf(FMT, priceOutbound + tax / 2) } else { $25 = sprintf(FMT, priceExc / 2) $26 = sprintf(FMT, priceInc / 2) } } print }
You just run that single awk
script with:
awk -F'","' -f prog.awk data.txt
With the test data you provided, here’s the before and after, with markers for field numbers 25 and 26:
<-25-> <-26-> "111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125" "111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"