I am trying to write a Linux bash script that will help me generate some statistics that I need from text files. Assume the following format in the text files that I am using:
"string : pathname1 : pathname2 : pathname3 : … pathnameN"
Where pathname ” i ” is the full path of the file in which I found a specific string. For example, such a file could look like this:
logile.txt
string : "version" pathname1: /home/Desktop/myfile.txt pathname2 : /usr/lib/tmp/sample.txt string : "user" pathname1 : temp1/tmpfiles/user.txt pathname2 : newfile.txt pathname3 : /Downloads/myfiles/old/credentials.txt string : "admin" pathname1 : string: "build" pathname1 : Documents/projects/myproject/readme.txt pathname2 : Desktop/readmetoo.txt
In this example, I would like my bash script to inform me that I searched for a total of 4 words (version,user,admin,build) and also that the word that was found in most files was “user”, which was found in 3 files. Is using “awk” command a good approach? I am not familiar with bash scripts,so any help would be useful! Thanks!
Advertisement
Answer
This is not such an easy task, it can be probably done in awk alone, but you were asking about bash script. The following bash script:
#!/bin/bash set -euo pipefail wordcount=0 tmp="" # for each line in input file read 3 fields while read string name rest; do # in each line the first word should be equal to string if [ "$string" != string ]; then # if it isn't continue to the next line continue; fi # remove '"' from name name=${name//"} # the rest has a format of pathname <path> # substitute every space with newline in the rest # and remove lines containing pathname'i' # to get only paths in the rest rest=$(echo "$rest" | tr ' ' 'n' | grep -v "pathname" ||:) # count the lines in rest, each path should be in different line restcnt=$(wc -l <<<"$rest") # save restcnt and name in single line in tmp string to parse later tmp+="$restcnt $name"$'n' # increment wordcount wordcount=$[$wordcount+1] # feed while read loop with a command substitution done < <( # cat stdin or file given as script argument # and substitude all doublepoints ':' with spaces cat "$@" | sed 's/:/ /g' ) # sort the tmp string from the lowest to biggest # and get last line (ie. biggest number tmp=$(echo "$tmp" | sort -n | tail -n1) # split tmp into two variables - the word and the count read mostcnt mostfile <<<"$tmp" # and print user information echo "You searched for a total of $wordcount words." echo "The word that was found in most files was: $mostfile" echo " which was found in $mostcnt files."
… run with the input from the following logile.txt
…
string : "version" pathname1:/home/Desktop/myfile.txt pathname2 : /usr/lib/tmp/sample.txt string : "user" pathname1: temp1/tmpfiles/user.txt pathname2: newfile.txt pathname3: /Downloads/myfiles/old/credentials.txt string : "admin" pathname1 : string: "build" pathname1: Documents/projects/myproject/readme.txt pathname2:Desktop/readmetoo.txt
… produces the following result:
$ /tmp/1.sh <./logile.txt You searched for a total of 4 words. The word that was found in most files was: user which was found in 6 files.