Skip to content
Advertisement

using awk to count the number of occurrences of pattern from another file

I am trying to take a file containing a list and count how many times items in that list occur in a target file. something like:

list.txt
blonde
red
black

target.txt
bob blonde male
sam blonde female

desired_output.txt
blonde 2
red 0
black 0

I have coopted the following code to get the values that are present in target.txt:

awk '{count[$2]++} END {for (word in count) print word, count[word]}' target.txt

But the output does not include the desired items that are in the liist.txt but not the target.txt

current_output.txt
blonde 2

I have tried a few things to get this working including:

awk '{word[$1]++;next;count[$2]++} END {for (word in count) print word, count[word]}' list.txt target.txt

However, I have had no success.

Could anyone help me make it so that this awk statement reads the key.txt file? any explanation of the code would also be much appreciated. Thanks!

Advertisement

Answer

awk '
  NR==FNR{a[$0]; next}
  {
    for(i=1; i<=NF; i++){
      if ($i in a){ a[$i]++ }
    }
  }
  END{
    for(key in a){ printf "%s %dn", key, a[key] }
  }
' list.txt target.txt
  • NR==FNR{a[$0]; next} The condition NR==FNR is only true for the first file, so the keys of array a are lines of list.txt.

  • for(i=1; i<=NF; i++) Now for the second file, this loops over all its fields.

    • if ($i in a){ a[$i]++ } This checks if the field $i is present as a key in the array a. If yes, the value (initially zero) associated with that key is incremented.
  • At the END, we just print the key followed by the number of occurrences a[key] and a newline (n).

Output:

blonde 2
red 0
black 0

Notes:

  1. Because of %d, the printf statement forces the conversion of a[key] to an integer in case it is still unset. The whole statement could be replaced by a simpler print key, a[key]+0. I missed that when writing the answer, but now you know two ways of doing the same thing. 😉

  2. In your attempt you were, for some reason, only addressing field 2 ($2), ignoring other columns.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement