Use rsync's checksums to deduplicate across backups

Thu Nov 3 08:35:25 MDT 2011

Alex Waite (alexqw85 at gmail.com) wrote on 2 November 2011 20:09:
 >    Recently I learned that rsync does a checksum of every file
 >transferred.  I thought it might be interesting to record the path and
 >checksum of each file in a table.  On future backups, the checksum of
 >a file being backed up could be looked up in the table.  If there's a
 >matching checksum, a hard link will be created to the match instead of
 >storing a new copy.  This means that the use of hard link won't be
 >limited to just the immediately preceding snapshot (as is the case
 >with my current setup).  Instead a hard link could be created to an
 >identical file located in a different machine's snapshot.
...
 >    Is this approach even possible, or am I missing something?  I know
 >my labs have a lot of duplicate data across many machines, so this
 >could save me hundreds of GiBs, maybe even a TiB or two.

It is but the management of it all is up to you; it's not rsync's job.

 >    If this is possible, how can I save the resulting checksum of a
 >file from rsync?

You'll have to use at least rsync v3 in the source machines and in the
backup one you need v3.1. Configure --out-format with %C to have the
md5 in the log. Note that rsync only puts the md5 when it pulls the
file (or you use -c); if it does a hardlink itself the md5 is not
computed, so it's not put in the log.