I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff
command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that’s 7000^2 sequential diffs and not very efficient:
JavaScript
x
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?
Advertisement
Answer
On Debian 11:
JavaScript
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt |
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
- Find and print all files null terminated (
-print0
). - Use
xargs
tomd5sum
them. - Save a copy of the sums and filenames in “listing.txt” file.
- Grab the sum and pass to
sort
thenuniq -c
to count, saving into the “dups.txt” file. - Use
awk
to list duplicates, thengrep
to find the sum and filename.