Skip to content
Advertisement

optimizing awk command for large file

I have these functions to process a 2GB text file. I’m splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.

What else can I try make the script faster?

A bit of details:

  1. I feed my input csv into a while loop to be read line by line.
  2. I grabbed the values from the csv line from 4 fields in the read2col function
  3. The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I’m rounding the result to 2 decimal places. Then, print the line to a text file.

Sample data:

"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"

Script:

read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}


#################################################
#for each line in the csv
mainf()
{
cd $infarepath

while read -r line; do
        #read the value of csv fields into variables
        read2col

        if [[ $is_one_way == 0 ]]; then
                if [[ $price_outbound > 0 ]]; then
                        #calculate price inc and print the entire line to txt file
                        echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
                else
                        #divide price ecx and inc by 2 if price outbound is not greater than 0
                        echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
                fi
        else
                echo $line >>"$csvsplitfile".tmp
        fi

done < $csvsplitfile
}

Advertisement

Answer

The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let’s do some quick, back-of-the-envelope calculations.

Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.

And, while Linux admirably handles fork and exec as efficiently as possible, it’s not without cost:

pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s

In addition, bash hasn’t really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.

An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:

time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )

For varying file sizes (user+sys time, averaged over a few runs), it’s quite interesting:

# of lines   cat-method   while-method
----------   ----------   ------------
     1,000       0.375s         0.031s
    10,000       0.391s         0.234s
   100,000       0.406s         1.994s
 1,000,000       0.391s        19.844s
10,000,000       0.375s       205.583s
44,000,000       0.453s       889.402s

From this, it appears that the while method can hold its own for smaller data sets, it really does not scale well.


Since awk itself has ways to do calculations and formatted output, processing the file with one single awk script, rather than your bash/multi-awk-per-line combination, will make the cost of creating all those processes and line-based delays go away.

This script would be a good first attempt, let’s call it prog.awk:

BEGIN {
    FMT = "%.2f"
    OFS = FS
}
{
    isOneWay=$7
    priceOutbound=$30
    priceExc=$25
    tax=$27
    priceInc=$26

    if (isOneWay == 0) {
        if (priceOutbound > 0) {
            $25 = sprintf(FMT, priceOutbound)
            $26 = sprintf(FMT, priceOutbound + tax / 2)
        } else {
            $25 = sprintf(FMT, priceExc / 2)
            $26 = sprintf(FMT, priceInc / 2)
        }
    }
    print
}

You just run that single awk script with:

awk -F'","' -f prog.awk data.txt

With the test data you provided, here’s the before and after, with markers for field numbers 25 and 26:

                                                                                                                                                                                      <-25->   <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement