I have a file example.txt
with about 3000 lines with a string in each line. A small file example would be:
>cat example.txt saudifh sometestPOIFJEJ sometextASLKJND saudifh sometextASLKJND IHFEW foo bar
I want to check all repeated lines in this file and output them. The desired output would be:
>checkRepetitions.sh found two equal lines: index1=1 , index2=4 , value=saudifh found two equal lines: index1=3 , index2=5 , value=sometextASLKJND
I made a script checkRepetions.sh
:
#!bin/bash size=$(cat example.txt | wc -l) for i in $(seq 1 $size); do i_next=$((i+1)) line1=$(cat example.txt | head -n$i | tail -n1) for j in $(seq $i_next $size); do line2=$(cat example.txt | head -n$j | tail -n1) if [ "$line1" = "$line2" ]; then echo "found two equal lines: index1=$i , index2=$j , value=$line1" fi done done
However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds… I tried to store the file in memory by doing lines=$(cat example.txt)
and doing line1=$(cat $lines | cut -d',' -f$i)
but this is still very slow…
Advertisement
Answer
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.
$ cat tst.awk { val2hits[$0] = val2hits[$0] FS NR } END { for (val in val2hits) { numHits = split(val2hits[val],hits) if ( numHits > 1 ) { printf "found %d equal lines:", numHits for ( hitNr=1; hitNr<=numHits; hitNr++ ) { printf " index%d=%d ,", hitNr, hits[hitNr] } print " value=" val } } } $ awk -f tst.awk file found 2 equal lines: index1=1 , index2=4 , value=saudifh found 2 equal lines: index1=3 , index2=5 , value=sometextASLKJND
To give you an idea of the performance difference using a bash script that’s written to be as efficient as possible and an equivalent awk script:
bash:
$ cat tst.sh #!/bin/bash case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac # initialize an associative array, mapping each string to the last line it was seen on declare -A lines=( ) lineNum=0 while IFS= read -r line; do (( ++lineNum )) if [[ ${lines[$line]} ]]; then printf 'Content previously seen on line %s also seen on line %s: %sn' "${lines[$line]}" "$lineNum" "$line" fi lines[$line]=$lineNum done < "$1" $ time ./tst.sh file100k > ou.sh real 0m15.631s user 0m13.806s sys 0m1.029s
awk:
$ cat tst.awk lines[$0] { printf "Content previously seen on line %s also seen on line %s: %sn", lines[$0], NR, $0 } { lines[$0]=NR } $ time awk -f tst.awk file100k > ou.awk real 0m0.234s user 0m0.218s sys 0m0.016s
There are no differences in the output of both scripts:
$ diff ou.sh ou.awk $
The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:
awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100k
When the input file had zero duplicate lines (generated by seq 100000 > nodups100k
) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:
$ time ./tst.sh nodups100k > ou.sh real 0m15.179s user 0m13.322s sys 0m1.278s $ time awk -f tst.awk nodups100k > ou.awk real 0m0.078s user 0m0.046s sys 0m0.015s