Use rsync's checksums to deduplicate across backups

Wed Nov 2 19:55:58 MDT 2011

On 2011-11-03, Alex Waite <alexqw85 at gmail.com> wrote:
>     I apologize if this has already been discussed before, but as of
> yet I have been unable to find any info on the topic.
>     I have a very simple (and common) disk based backup system using
> rsync, hard links, and a little bit of perl to glue it together.
> Remote machines are backed up regularly using hardlinks across each
> snapshot to reduce disk usage.
>     Recently I learned that rsync does a checksum of every file
> transferred.  I thought it might be interesting to record the path and
> checksum of each file in a table.  On future backups, the checksum of
> a file being backed up could be looked up in the table.  If there's a
> matching checksum, a hard link will be created to the match instead of
> storing a new copy.  This means that the use of hard link won't be
> limited to just the immediately preceding snapshot (as is the case
> with my current setup).  Instead a hard link could be created to an
> identical file located in a different machine's snapshot.
>     My initial concerns were that doing the checksums would be too CPU
> expensive, but if rsync is already doing them then that isn't a
> concern.  My next thought was that the checksums would be susceptible
> to collisions, thus leading to potential data loss by linking to a
> non-identical file.  However, from what I've read on wikipedia, rsync
> does both a MD5 and a rolling checksum.  These two together make it
> /very/ unlikely to have a collision, thus accidentally linking to a
> non-identical file is unlikely.
>     Is this approach even possible, or am I missing something?  I know
> my labs have a lot of duplicate data across many machines, so this
> could save me hundreds of GiBs, maybe even a TiB or two.
>     If this is possible, how can I save the resulting checksum of a
> file from rsync?
>    Thank you for your time.  I look forward to hearing your thoughts.
>
> ---Alex

Not a direct answer, but this may do what you want:

  http://gitweb.samba.org/?p=rsync-patches.git;a=blob;f=link-by-hash.diff

  This patch adds the --link-by-hash=DIR option, which hard links received
  files in a link farm arranged by MD4 file hash.  The result is that the system
  will only store one copy of the unique contents of each file, regardless of
  the file's name.

Cheers,

Chris