Find duplicate files

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

Asked By: student

||

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n

This will break if your filenames contain newlines.

Answered By: Chris Down

Short answer: yes.

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it’s not that difficult – hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.

Answered By: peterph

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
–help option which further details its parameters.

   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

You can also do this manually if you don’t want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum {} ; > md5sums
awk '{print $1}' md5sums | sort | uniq -d > dupes
while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

Answered By: terdon

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
    checksum=${line:0:32}
    filename=${line:34}
    if [ $checksum == $last_checksum ]; then
        if [ ${last_filename:-0} != '0' ]; then
            echo $last_filename
            unset last_filename
        fi
        echo $filename
    else
        if [ ${last_filename:-0} == '0' ]; then
            echo "======="
        fi
        last_filename=$filename
    fi

    last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

Answered By: reith

Wikipedia once had an article with a list of available open source software for this task, but it’s now been deleted.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete – very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

- Czkawka: https://qarmin.github.io/czkawka/

FDupes and DupeGuru work on many systems (Windows, Mac and Linux). I’ve not checked FSLint or Czkawka.

Answered By: MordicusEtCubitus

Here’s my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
  echo -n '.'
  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi
  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
  MD5=`echo $MD5 | cut -d' ' -f1`
  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi
  echo $MD5 $i >> md5-partial.txt
done

It’s different in that it only hashes up to first 1 MB of the file.
This has few issues / features:

  • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
  • Checking by file size first could speed this up.
  • Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

Answered By: Ondra Žižka

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n > myjdups_sorted.txt
Answered By: Sebastian Müller

I realize this is necro but it is highly relevant. I had asked a similar question on Find duplicate files based on first few characters of filename and what was a presented was a solution to use some awk script.

I use it for mod conflict cleanup, useful in Forge packs 1.14.4+ because Forge now disabled mods that are older instead of FATAL crashing and letting you know of the duplicate.

#!/bin/bash

declare -a names

xIFS="${IFS}"
IFS="^M"

while true; do
awk -F'[-_ ]' '
    NR==FNR {seen[tolower($1)]++; next}
    seen[tolower($1)] > 1
' <(printf "%sn" *.jar) <(printf "%sn" *.jar) > tmp.dat

        IDX=0
        names=()


        readarray names < tmp.dat

        size=${#names[@]}

        clear
        printf 'nPossible Dupesn'

        for (( i=0; i<${size}; i++)); do
                printf '%st%s' ${i} ${names[i]}
        done

        printf 'nWhich dupe would you like to delete?nEnter # to delete or q to quitn'
        read n

        if [ $n == 'q' ]; then
                exit
        fi

        if [ $n -lt 0 ] || [ $n -gt $size ]; then
                read -p "Invalid Option: present [ENTER] to try again" dummyvar
                continue
        fi

        #clean the carriage return n from the name
        IFS='^M'
        read -ra TARGET <<< "${names[$n]}"
        unset IFS

        #now remove the first element from the filesystem
        rm "${TARGET[0]}" 
        echo "removed ${TARGET[0]}" >> rm.log
done

IFS="${xIFS}"

I recommend saving it as "dupes.sh" to your personal bin or /usr/var/bin

Answered By: Kreezxil

I had a situation where I was working in an environment where I couldn’t install new software, and had to scan >380 GB of JPG and MOV files for duplicates. I developed the following POSIX awk script to process all of the data in 72 seconds (as opposed to the find -exec md5sum approach, that took over 90 minutes to run):

https://github.com/taltman/scripts/blob/master/unix_utils/find-dupes.awk

You call it as follows:

ls -lTR | awk -f find-dupes.awk

It was developed on a FreeBSD shell environment, so might need some tweaks to work optimally in a GNU/Linux shell environment.

Answered By: taltman

You can find duplicate with this command:

find . ! -empty -type f -print0 | xargs -0 -P"$(nproc)" -I{} md5sum "{}" | sort | uniq -w32 -dD
Answered By: Bensuperpc

On Linux, you can use the following tools to find duplicate files.

https://github.com/adrianlopezroche/fdupes
https://github.com/arsenetar/dupeguru
https://github.com/jbruchon/jdupes
https://github.com/pauldreik/rdfind
https://github.com/pixelb/fslint (Python 2.x)
https://github.com/sahib/rmlint
https://github.com/qarmin/czkawka (Rust)
Answered By: Akhil

For free open-source Linux duplicate file finders, there is a new kid on the block and it’s even in a trendy language, Rust.

It is called Czkawka (which apparently means hiccup)

So it does have an unpronounceable name unless you speak Polish.

It is based very much on some of the ideas in FSlint (which can now be difficult to make work as it is no longer maintained and uses the now deprecated Python 2.x).

Czkawka has both GUI and CLI versions and is reported to be faster than FSlint and Fdupes.

There is also a Githup repo For those that want to fork it just to change the name.

Answered By: Jay M

With any POSIX find, stat and awk commands:

Advantages:

  1. It handles filenames with whitespaces or any other special characters.
  2. and it only calls external command md5sum for those files which are same in size, and so…
  3. It reports the duplicated files based on md5sum checksum at the end, and finally…
  4. It generates the NULL delimited output of file’s size in bytes, file’s checksum and file’s path of duplicates, so it’s can easily post-process if needed.
  5. It should be enough faster.
find . -type f -exec stat --printf='%s/%n' {} + |
awk '
BEGIN{
        FS = "/"
        RS = ORS = ""
        q = "47"
        md5_cmd = "md5sum"
}

{
    #get the files path from the two columns data delimited by slash char reported 
    #by the stat command.
    filePath = substr($0, index($0, "/") +1)

    #record and group filePath having the same fileSize with NULL delimited
    sizes[$1] = ($1 in sizes? sizes[$1] : "") filePath ORS
}

END {
    for (size in sizes) {

        #split the filesPath for each group of files to 
        #calculate the check-sum for last confirmation to see if there are
        #any duplicate files among the same sized files
        filesNr = split(sizes[size], filesName, ORS)

        #call md5sum only if there are more than two files with the same size in that group.
        if (filesNr > 2) {
            for (i = 1; i < filesNr; i++) {
                if ((md5_cmd " " q filesName[i] q) | getline md5 >0) {
                    
                    #split to extract the hash of a file
                    split(md5, hash, " ")

                    #remove leading back-slash from the hash if a fileName contain
                    #back-slash char in its name. see https://unix.stackexchange.com/q/424628/72456
                    sub(/\/, "", hash[1])

                    #records all the same sized filesPath along with their hash, again NULL delimited
                    hashes[hash[1]] = (hash[1] in hashes? hashes[hash[1]] : "") filesName[i] ORS

                    #record also the size of files with hash used as key mapping
                    fileSize[hash[1]] = size
                }
            }
        }
    }
    for (fileName in hashes) {

        #process the hash of the same sized filesPath to verify if there is a hash
        #which occupied for more than one file.
        #here hash is the key and filesName are values of the hashes[] array.
        filesNr = split(hashes[fileName], filesName, ORS)

        #OK, if there is a hash with +2 files, then we found duplicates, print the size, hash and the path.
        if ( filesNr> 2) {
            print  fileSize[fileName] " bytes, MD5: " fileName
            for(i=1; i < filesNr; i++)
                print filesName[i]
        }
    }
}'
Answered By: αғsнιη
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.