Are there any deduplication scripts that use btrfs CoW as dedup?

Looking for deduplication tools on Linux there are plenty, see e.g. this wiki page.

Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy.

With the rise of btrfs there would be another option: creating a CoW (copy-on-write) copy of a file (like cp reflink=always). I have not found any tool that does this, is anyone aware of tool that does this?

Asked By: Peter Smit


I wrote bedup for this purpose. It combines incremental btree scanning with CoW-deduplication. Best used with Linux 3.6, where you can run:

sudo bedup dedup
Answered By: Gabriel

I tried bedup. While good (and has some useful differentiated features that may make it the best choice for many), it seems to scan the entirety of all target files for checksums.

Which is painfully slow.

Other programs on the other hand, such as rdfind and rmlint, scan differently.

rdfind has an “experimental” feature for using btrfs reflink. (And “solid” options for hardlinks, symlinks, etc.)

rmlint has “solid” options for btrfs clone, reflink, regular hardlinks, symlinks, delete, and your own custom commands.

But more importantly, rdfind and rmlint are significantly faster. As in, orders of magnitude. Rather than scanning all target files for checksums, it does this, approximately:

  • Scan the whole target filesystem, gathering just paths and filesizes.
  • Remove from consideration, files with unique filesizes. This alone, save scads of time and disk activity. (“Scads” is some inverse exponential function or something.)
  • Of the remaining candidates, scan the first N bytes. Remove from consideration, those with same filesizes but different first N bytes.
  • Do the same for the last N bytes.
  • Only of that (usually tiny fraction) remaining, scan for checksums.

Other advantages of rmlint I’m aware of:

  • You can specify the checksum. md5 too scary? Try sha256. Or 512. Or bit-for-bit comparison. Or your own hashing function.
  • It gives you the option of Btrfs “clone”, and “reflink”, rather than just reflink. “cp –reflink=always” is just a bit risky, in that it’s not atomic, it’s not aware of what else is going on for that file in the kernel, and it doesn’t always preserve metadata. “Clone”, OTOH (which is a shorthand term…I’m blanking on the official API-related name), is a kernel-level call that is atomic and preserves metadata. Almost always resulting in the same thing, but a tad more robust and safe. (Though most programs are smart enough to not delete the duplicate file, if it can’t first successfully make a temp reflink to the other.)
  • It has a ton of options for many use-cases (which is also a drawback).

I compared rmlint with deduperemove–which also blindly scans all of every target file for checksums. Duperemove took several days on my volume to complete (4 I think), going full-tilt. fmlint took a few hours to identify duplicates, then less than a day to dedup them with Btrfs clone.

(That said, anyone making the effort to write and support quality, robust software and give it away for free, deserves major kudos!)

Btw: You should avoid deduping using regular hardlinks as a “general” dedup solution, at all costs.

While hardlinks can be extremely handy in certain targeted use cases (e.g. individual files or with a tool that can scan for specific file types exceeding some minimum size–or as part of many free and commercial backup/snapshot solutions), it can be disastrous for “deduplication” on a large general-use filesystem. The reason is that most users may have thousands of files on their filesystem, that are binary identical, but functionally completely different.

For example, many programs generate template and/or hidden settings files (sometimes in every single folder it can see), that are initially identical–and most remain so, until you, the user, need them not to be.

As a specific illustration: Photo thumbnail cache files, which countless programs generate in the folder containing the photos (and for good reason–portability), can take hours or days to generate but then make using a photo app a breeze. If those initial cache files are all hardlinked together, then you later open the app on a directory and it builds a large cache…then guess what: Now EVERY folder that has a previously hardlinked cache, now has the wrong cache. Potentially, with disastrous results that may result in accidental data destruction. And also potentially in a way that explodes a backup solution that isn’t hardlink-aware.

Furthermore, it can ruin entire snapshots. The whole point of snapshots is so that the “live” version can continue to change, with the ability to roll back to a previous state. If everything is hardlinked together though…you “roll back” to the same thing.

The good news though is that deduping with Btrfs clone/reflink, can undo that damage (I think–since during the scan, it should see hardlinked files as identical…unless it has logic to not consider hardlinks. It probably depends on the specific utility doing the deduping.)

Answered By: Jim

11-years on: I’d suggest fclones. It does exactly this with it’s dedupe subcommand.

It’s an excelent tool, speedy (written in rust), and has just served me well as a great tool to dedupe and rationalise all of my backups.

Answered By: dsz
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.