Bash script that prints statistics about text files [closed]

I am trying to write a Linux bash script that will help me generate some statistics that I need from text files. Assume the following format in the text files that I am using:

"string : pathname1 : pathname2 : pathname3 : … pathnameN"

Where pathname ” i ” is the full path of the file in which I found a specific string. For example, such a file could look like this:


string : "version" pathname1: /home/Desktop/myfile.txt pathname2 : /usr/lib/tmp/sample.txt 

string : "user" pathname1 : temp1/tmpfiles/user.txt  pathname2 : newfile.txt pathname3 : /Downloads/myfiles/old/credentials.txt

string : "admin" pathname1 : 

string: "build" pathname1 : Documents/projects/myproject/readme.txt pathname2 
 : Desktop/readmetoo.txt

In this example, I would like my bash script to inform me that I searched for a total of 4 words (version,user,admin,build) and also that the word that was found in most files was “user”, which was found in 3 files. Is using “awk” command a good approach? I am not familiar with bash scripts,so any help would be useful! Thanks!



This is not such an easy task, it can be probably done in awk alone, but you were asking about bash script. The following bash script:

set -euo pipefail

# for each line in input file read 3 fields
while read string name rest; do
    # in each line the first word should be equal to string
    if [ "$string" != string ]; then 
        # if it isn't continue to the next line
    # remove '"' from name
    # the rest has a format of pathname <path> 
    # substitute every space with newline in the rest 
    # and remove lines containing pathname'i' 
    # to get only paths in the rest
    rest=$(echo "$rest" | tr ' ' 'n' | grep -v "pathname" ||:)
    # count the lines in rest, each path should be in different line
    restcnt=$(wc -l <<<"$rest")
    # save restcnt and name in single line in tmp string to parse later
    tmp+="$restcnt $name"$'n'
    # increment wordcount

# feed while read loop with a command substitution
done < <(
    # cat stdin or file given as script argument
    # and substitude all doublepoints ':' with spaces
    cat "$@" | sed 's/:/ /g'

# sort the tmp string from the lowest to biggest
# and get last line (ie. biggest number
tmp=$(echo "$tmp" | sort -n | tail -n1)
# split tmp into two variables - the word and the count
read mostcnt mostfile <<<"$tmp"

# and print user information
echo "You searched for a total of $wordcount words."
echo "The word that was found in most files was: $mostfile"
echo " which was found in $mostcnt files."

… run with the input from the following logile.txt

string : "version" pathname1:/home/Desktop/myfile.txt pathname2 : /usr/lib/tmp/sample.txt

string : "user" pathname1: temp1/tmpfiles/user.txt pathname2: newfile.txt pathname3: /Downloads/myfiles/old/credentials.txt

string : "admin" pathname1 :

string: "build" pathname1: Documents/projects/myproject/readme.txt pathname2:Desktop/readmetoo.txt

… produces the following result:

$ /tmp/ <./logile.txt 
You searched for a total of 4 words.
The word that was found in most files was: user
 which was found in 6 files.
