How to efficiently loop through the lines of a file in Bash?

Question

I have a file example.txt with about 3000 lines with a string in each line. A small file example would be: I want to check all repeated lines in this file and output them. The desired output would be: I made a script checkRepetions.sh: However this script is very slow, it takes more than 10 minutes to run. In python

Accepted Answer

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.$ cat tst.awk{ val2hits[$0] = val2hits[$0] FS NR }END { for (val in val2hits) { numHits = split(val2hits[val],hits) if ( numHits > 1 ) { printf "found %d equal lines:", numHits for ( hitNr=1; hitNr<=numHits; hitNr++ ) { printf " index%d=%d ,", hitNr, hits[hitNr] } print " value=" val } }}$ awk -f tst.awk filefound 2 equal lines: index1=1 , index2=4 , value=saudifhfound 2 equal lines: index1=3 , index2=5 , value=sometextASLKJNDTo give you an idea of the performance difference using a bash script that’s written to be as efficient as possible and an equivalent awk script:bash:$ cat tst.sh#!/bin/bashcase $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac# initialize an associative array, mapping each string to the last line it was seen ondeclare -A lines=( )lineNum=0while IFS= read -r line; do (( ++lineNum )) if [[ ${lines[$line]} ]]; then printf 'Content previously seen on line %s also seen on line %s: %sn' "${lines[$line]}" "$lineNum" "$line" fi lines[$line]=$lineNumdone < "$1"$ time ./tst.sh file100k > ou.shreal 0m15.631suser 0m13.806ssys 0m1.029sawk:$ cat tst.awklines[$0] { printf "Content previously seen on line %s also seen on line %s: %sn", lines[$0], NR, $0}{ lines[$0]=NR }$ time awk -f tst.awk file100k > ou.awkreal 0m0.234suser 0m0.218ssys 0m0.016sThere are no differences in the output of both scripts:$ diff ou.sh ou.awk$The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100kWhen the input file had zero duplicate lines (generated by seq 100000 > nodups100k) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:$ time ./tst.sh nodups100k > ou.shreal 0m15.179suser 0m13.322ssys 0m1.278s$ time awk -f tst.awk nodups100k > ou.awkreal 0m0.078suser 0m0.046ssys 0m0.015s

Advertisement

Answer