Skip to content
Advertisement

How to efficiently loop through the lines of a file in Bash?

I have a file example.txt with about 3000 lines with a string in each line. A small file example would be:

>cat example.txt
saudifh
sometestPOIFJEJ
sometextASLKJND
saudifh
sometextASLKJND
IHFEW
foo
bar

I want to check all repeated lines in this file and output them. The desired output would be:

>checkRepetitions.sh
found two equal lines: index1=1 , index2=4 , value=saudifh
found two equal lines: index1=3 , index2=5 , value=sometextASLKJND

I made a script checkRepetions.sh:

#!bin/bash
size=$(cat example.txt | wc -l)
for i in $(seq 1 $size); do
i_next=$((i+1))
line1=$(cat example.txt | head -n$i | tail -n1)
for j in $(seq $i_next $size); do
line2=$(cat example.txt | head -n$j | tail -n1)
if [ "$line1" = "$line2" ]; then
echo "found two equal lines: index1=$i , index2=$j , value=$line1"
fi
done
done

However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds… I tried to store the file in memory by doing lines=$(cat example.txt) and doing line1=$(cat $lines | cut -d',' -f$i) but this is still very slow…

Advertisement

Answer

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.

$ cat tst.awk
{ val2hits[$0] = val2hits[$0] FS NR }
END {
    for (val in val2hits) {
        numHits = split(val2hits[val],hits)
        if ( numHits > 1 ) {
            printf "found %d equal lines:", numHits
            for ( hitNr=1; hitNr<=numHits; hitNr++ ) {
                printf " index%d=%d ,", hitNr, hits[hitNr]
            }
            print " value=" val
        }
    }
}

$ awk -f tst.awk file
found 2 equal lines: index1=1 , index2=4 , value=saudifh
found 2 equal lines: index1=3 , index2=5 , value=sometextASLKJND

To give you an idea of the performance difference using a bash script that’s written to be as efficient as possible and an equivalent awk script:

bash:

$ cat tst.sh
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac

# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0

while IFS= read -r line; do
  (( ++lineNum ))
  if [[ ${lines[$line]} ]]; then
     printf 'Content previously seen on line %s also seen on line %s: %sn' 
       "${lines[$line]}" "$lineNum" "$line"
  fi
  lines[$line]=$lineNum
done < "$1"

$ time ./tst.sh file100k > ou.sh
real    0m15.631s
user    0m13.806s
sys     0m1.029s

awk:

$ cat tst.awk
lines[$0] {
    printf "Content previously seen on line %s also seen on line %s: %sn", 
       lines[$0], NR, $0
}
{ lines[$0]=NR }

$ time awk -f tst.awk file100k > ou.awk
real    0m0.234s
user    0m0.218s
sys     0m0.016s

There are no differences in the output of both scripts:

$ diff ou.sh ou.awk
$

The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:

awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100k

When the input file had zero duplicate lines (generated by seq 100000 > nodups100k) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:

$ time ./tst.sh nodups100k > ou.sh
real    0m15.179s
user    0m13.322s
sys     0m1.278s

$ time awk -f tst.awk nodups100k > ou.awk
real    0m0.078s
user    0m0.046s
sys     0m0.015s
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement