Use rsync's checksums to deduplicate across backups
alexqw85 at gmail.com
Wed Nov 2 19:09:48 MDT 2011
I apologize if this has already been discussed before, but as of
yet I have been unable to find any info on the topic.
I have a very simple (and common) disk based backup system using
rsync, hard links, and a little bit of perl to glue it together.
Remote machines are backed up regularly using hardlinks across each
snapshot to reduce disk usage.
Recently I learned that rsync does a checksum of every file
transferred. I thought it might be interesting to record the path and
checksum of each file in a table. On future backups, the checksum of
a file being backed up could be looked up in the table. If there's a
matching checksum, a hard link will be created to the match instead of
storing a new copy. This means that the use of hard link won't be
limited to just the immediately preceding snapshot (as is the case
with my current setup). Instead a hard link could be created to an
identical file located in a different machine's snapshot.
My initial concerns were that doing the checksums would be too CPU
expensive, but if rsync is already doing them then that isn't a
concern. My next thought was that the checksums would be susceptible
to collisions, thus leading to potential data loss by linking to a
non-identical file. However, from what I've read on wikipedia, rsync
does both a MD5 and a rolling checksum. These two together make it
/very/ unlikely to have a collision, thus accidentally linking to a
non-identical file is unlikely.
Is this approach even possible, or am I missing something? I know
my labs have a lot of duplicate data across many machines, so this
could save me hundreds of GiBs, maybe even a TiB or two.
If this is possible, how can I save the resulting checksum of a
file from rsync?
Thank you for your time. I look forward to hearing your thoughts.
More information about the rsync