Use rsync's checksums to deduplicate across backups

Johannes Totz jtotz at imperial.ac.uk
Thu Nov 3 07:19:34 MDT 2011


On 03/11/2011 01:09, Alex Waite wrote:
>     I apologize if this has already been discussed before, but as of
> yet I have been unable to find any info on the topic.
>     I have a very simple (and common) disk based backup system using
> rsync, hard links, and a little bit of perl to glue it together.
> Remote machines are backed up regularly using hardlinks across each
> snapshot to reduce disk usage.
>     Recently I learned that rsync does a checksum of every file
> transferred.  I thought it might be interesting to record the path and
> checksum of each file in a table.  On future backups, the checksum of
> a file being backed up could be looked up in the table.  If there's a
> matching checksum, a hard link will be created to the match instead of
> storing a new copy.  This means that the use of hard link won't be
> limited to just the immediately preceding snapshot (as is the case
> with my current setup).  Instead a hard link could be created to an
> identical file located in a different machine's snapshot.
>     My initial concerns were that doing the checksums would be too CPU
> expensive, but if rsync is already doing them then that isn't a
> concern.  My next thought was that the checksums would be susceptible
> to collisions, thus leading to potential data loss by linking to a
> non-identical file.  However, from what I've read on wikipedia, rsync
> does both a MD5 and a rolling checksum.  These two together make it
> /very/ unlikely to have a collision, thus accidentally linking to a
> non-identical file is unlikely.
>     Is this approach even possible, or am I missing something?  I know
> my labs have a lot of duplicate data across many machines, so this
> could save me hundreds of GiBs, maybe even a TiB or two.
>     If this is possible, how can I save the resulting checksum of a
> file from rsync?
>    Thank you for your time.  I look forward to hearing your thoughts.

Check out http://backuppc.sourceforge.net/, it's perl-based backup tool,
using rsync and doing exactly what you ask for.




More information about the rsync mailing list