Skip to content
Advertisement

Search for a mention of each filename contained in a directory, in every file in the directory

I’m trying to perform a search in my directory, to count the total number of times each individual file is referenced within the contents of all the files in the directory.

Essentially, I’m trying to more efficiently recreate the copy and paste of each ‘filename’ into the ‘search in this folder’ that I am currently doing, as there are around 400 files. As an output, I think the most useful format would be a list of each search term (filename), and the number of unique files that it occurs in. I am most interested in the files that have no occurrences, as these are likely to be able to be deleted, as they are now redundant.

My current thinking is to save a list of the filenames to a file called searchterms, and use grep -r -f searchterms to find all occurrences of the file. I’ve not had much luck with this however, as my use of -c so far has just resulted in the file being listed, not the search term.

Thanks in advance!

Example of usage:

file1
include file3
include file3

file2
content

file3
content

file4
include file3

Search terms would be: file1, file2, file3, file4.

Returned output (in some similar form):
file1: occurs in 0 files
file2: occurs in 0 files
file3: occurs in 2 files
file4: occurs in 0 files

Advertisement

Answer

for f1 in *; do cnt=0; for f2 in *; do grep -qw "$f1" "$f2" && ((++cnt)); done; echo "$cnt $f1"; done 
1 abc-file
0 abc.lst
1 abc0-file
1 abc_-file
0 def-file
0 fixedlen
0 num1000000
0 num128
0 num30000
0 num8
0 num_%header
0 par-test.sh
0 tsv-file.tsv

Human readable:

for f1 in *
do
    cnt=0
    for f2 in *
    do
       grep -qw "$f1" "$f2" && ((++cnt))
    done
    echo "$cnt $f1"
done 

Putting the hit counter first in the output makes for a simpler sort -n command. For high numbers of matches (>9) printf would help providing a clear tabular format.

Grep -m 1 stops searching after the first hit but is implied by -q. To not match file31 when looking for file3, -w is used. For all the misses, the files get searched from begin to end over and over again. Depending on the number of files, this might be a significant amount of time, making better optimization necessary.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement