Skip to content
Advertisement

Linux: what’s a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.

The simplest thing to do is to run a double loop over all the files in the directory tree. But that’s 7000^2 sequential diffs and not very efficient:

JavaScript

Is there a more efficient way to do it?

Advertisement

Answer

On Debian 11:

JavaScript
  • Find and print all files null terminated (-print0).
  • Use xargs to md5sum them.
  • Save a copy of the sums and filenames in “listing.txt” file.
  • Grab the sum and pass to sort then uniq -c to count, saving into the “dups.txt” file.
  • Use awk to list duplicates, then grep to find the sum and filename.
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement