Counter of unique files in a directory

I ran a program many times with output that was (slightly) non-deterministic. Each time, I printed the output to a file. I now have a directory of many text files (95,034), which probably have something like 4 different unique outputs. I would love to see the output in a format like this:

 A (50,000)
 B (30,000)
 C (10,000)
 D  (5,034)

But even just seeing the contents of A,B,C,D (the four different possible outputs) would be great. I don’t have time to manually dedupe 90,000 files. So how can I count or list the unique text files in a directory? Thanks!

Asked By: jxmorris12

||

Use a hash map to gather all unique files. The hash depends on the contents, so only the files with unique contents get an entry in the hash map.

declare -A unique_files
for file in *; do 
    unique_files["$(md5sum "$file" | cut -d ' ' -f 1)"]="$file"
done
echo "${unique_files[@]}"
Answered By: Hielke Walinga

You can also use sort and uniq for this. From within the folder where the files reside, enter:

find . -type f | awk '{ print "tr \\n @ < " $0 "; echo "}' | sh | sort | uniq --count

(Replace uniq --count by uniq -c if not using uniq from GNU coreutils.)

That should give you the results in one go. For simplicity and speed (avoiding hashes) we translate the newlines to @ — this could be any single character not part of the original file.

(This assumes that files in sub-folders, if they exist, are to be included. Another assumption is that there is no @ character in the file. If not please comment and I’ll adjust the command accordingly.)

Answered By: Ned64

Expanding slightly on @Isaac’s solution ….

Assuming bash syntax, and given:

$ find test -type f
test/AA
test/A
test/C
test/CC
test/B
test/D

where files A and AA are identical, as are C and CC;

This is an incrementally more effective command pipeline:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" ; |
    sort -k1,1 |
    uniq --count
      2 102f2ac1c3266e03728476a790bd9c11  -
      1 4c33d7f68620b7b137c0ca3385cb6597  -
      1 88178a003e2305475e754a7ec21d137d  -
      2 c7a739d5538cf472c8e87310922fc86c  -

The remaining problem now is that the md5 hashes don’t tell you which files are A, B, C or D. That can be solved, although it’s a slight bit fiddly.

First, move your files into a subdirectory, or move your PWD up one directory if that’s more convenient. In my example, I’m working in . and the files are in test/.

I’ll propose that you identify one each of the four file types, and copy them to file A, B, C and D (and beyond if you need to, up to Z):

$ cp -p test/file1002 ./A
...
$ cp -p test/file93002 ./N

etc. We can now build a hash table that defines the md5 hashes of each unique output file A-Z:

$ for file in [A-Z]; do 
      printf "s/%s/%s/n" "$(md5sum < $file )" "$file"; 
done
s/102f2ac1c3266e03728476a790bd9c11  -/A/
s/4c33d7f68620b7b137c0ca3385cb6597  -/B/
s/c7a739d5538cf472c8e87310922fc86c  -/C/
s/88178a003e2305475e754a7ec21d137d  -/D/

Notice that the hash table looks like sed syntax. Here’s why:

Let’s run the same find ... md5sum pipeline above:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" ; |
    sort -k1,1 |
    uniq --count

… and pipe it through a sed process that uses the hash table above to replace the hash values with the prototype file names. The sed command on its own would be:

sed -f <(
    for file in [A-Z]; do 
        printf "s/%s/%s/n" "$(md5sum < "$file")" "$file"; 
    done
)

So to connect it all together:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" ; |
    sort -k1,1 |
    uniq --count |
    sed -f <(
        for file in [A-Z]; do 
            printf "s/%s/%s/n" "$(md5sum < "$file")" "$file"; 
        done
    )
  2 A
  1 B
  1 D
  2 C

If you see output like this:

  2 A
  1 B
  1 5efa8621f70e1cad6aba9f8f4246b383  -
  1 D
  2 C

That means there is a file in test/ which has an MD5 value that doesn’t match your files A-D. In other words, there is an E output file format out there somewhere. Once you find it (md5sum test/* | grep 5efa8621f70e1cad6aba9f8f4246b383) you can copy it to E and re-run:

$ cp -p test/file09876 ./E
$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" ; |
    sort -k1,1 |
    uniq --count |
    sed -f <(
        for file in [A-Z]; do 
            printf "s/%s/%s/n" "$(md5sum < "$file")" "$file"; 
        done
    )
  2 A
  1 B
  1 E
  1 D
  2 C
Answered By: Jim L.

I’m a big fan of GNU datamash (https://www.gnu.org/software/datamash/). Here’s a sample output from a mocked up set of files I created and ran this command on:

$ md5sum * | datamash -W -s -g 1 count 2 -f
5591dadf0051bee654ea41d962bc1af0    junk1   27
9c08c31b951a1a1e0c3a38effaca5863    junk2   17
f1e5cbfade7063a0c4fa5083fd36bf1a    junk3   7

There are 27 files with the hash 5591…, and one of them is “junk1”. (Similarly 17 files that are the same as “junk2”, and 7 for “junk3”).

The -W says use whitespace as field delimiter. The -s -g 1 says sort and group by field 1 (which is the hash). The count could have been either field 1 or 2, doesn’t matter.

The -f says “print the entire input line”. This has a quirk, in that when you are printing aggregated results, it only prints the full line for the first line in each group that it found. In this case that works out fine, because it gives us one of the filenames involved in each dup-set, instead of all of them.

Answered By: user339730