Is there an easy way to replace duplicate files with hardlinks?

I’m looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.

Here’s the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I’d like to make it so they’re hardlinks, to save hard drive space.

Asked By: Josh

||

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

Answered By: Stefan

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:


filename1
filename2

filename3
filename4
filename5


with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

Answered By: tante

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

Answered By: fschmitt

Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

Answered By: Wei-Yin

This is one of the functions provided by “fslint” —
http://en.flossmanuals.net/FSlint/Introduction

Click the “Merge” button:

Screenshot

Answered By: LJ Wobker

I made a Perl script that does something similar to what you’re talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It’s come in handy on many, many occasions.

Answered By: amphetamachine
Answered By: waltinator

I’ve used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:

   --reflink[=WHEN]
          control clone/CoW copies. See below

       When  --reflink[=always]  is specified, perform a lightweight copy, where the 
data blocks are copied only when modified.  If this is not possible the
       copy fails, or if --reflink=auto is specified, fall back to a standard copy.
Answered By: Marcos

Since I’m not a fan of Perl, here’s a bash version:

#!/bin/bash

DIR="/path/to/big/files"

find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
 NEWSUM=`echo "$i" | sed 's/ .*//'`
 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
 if [ "$OLDSUM" == "$NEWSUM" ]; then
  echo ln -f "$OLDFILE" "$NEWFILE"
 else
  OLDSUM="$NEWSUM"
  OLDFILE="$NEWFILE"
 fi
done

This finds all files with the same checksum (whether they’re big, small, or already hardlinks), and hardlinks them together.

This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don’t have to redo the checksums each time). If anyone’s interested in the smarter, longer version, I can post it.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

Answered By: seren

Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:

  • filename
  • size
  • md5 checksum
  • byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 
  sort | uniq -w32 --all-repeated=separate
Answered By: johny why

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

Answered By: islam

rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (sha256, md5, or sha1).

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).


Edit: I can add that rdfind is really easy to use/very pedagogic. Just use the -dryrun true flag and it will be very intuitive, not scary (which, IMO, tools that delete files usually are).

Answered By: d-b
apt show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and
replaces them with hardlinks.

I also used jdupes recently with success.

Answered By: Julien Palard

On modern Linux these days there’s https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

Answered By: Matthew Bloch

If you’ll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.

Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort 🙂

Specially look at ZFS filesystem. This is available as FUSE, but in this way it’s very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don’t understand, why linux doesn’t support as drivers, it is way for many other operating systems / kernels.

File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).

Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM 🙂

Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don’t change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS 🙂 What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.

Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You 🙂

Answered By: Znik

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

Answered By: Jorge H B Sampaio Jr

Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn’t delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that’s Windows, not Unix…

My solution would be to create a “common” file in a hidden folder, and replace the actual duplicates with symbolic links… then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two “files” are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.

Answered By: Amaroq Starwind

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.

a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

jdupes has been mentioned in a comment but deserves its own answer, since it is probably available in most distributions and runs pretty fast (it just freed 2.7 GB of a 98 % full 158 GB partition (SSD drive) in about one minute) :

jdupes -rL /foo/bar

There is a new file level deduplication tool:
https://gitlab.com/lasthere/dedup

On "BTRFS" and "XFS" it can can directly "reflink" identical files.
While for non reflink supporting file systems, one first needs
to produce a shell script via option "–print-only sh" and
modify it such that it replaces the found identical files by a hardlink (cp –link).

It is especially useful if run regularly and file modifications (new dedupe candidates) are located separately, so that only that directory needs a deeper
check, while all other locations are only scanned for finding deduplication sources.

(I’m author of dedup)

Answered By: lasthere