Use rsync's checksums to deduplicate across backups
Carlos Carvalho
carlos at fisica.ufpr.br
Thu Nov 3 08:35:25 MDT 2011
Alex Waite (alexqw85 at gmail.com) wrote on 2 November 2011 20:09:
> Recently I learned that rsync does a checksum of every file
>transferred. I thought it might be interesting to record the path and
>checksum of each file in a table. On future backups, the checksum of
>a file being backed up could be looked up in the table. If there's a
>matching checksum, a hard link will be created to the match instead of
>storing a new copy. This means that the use of hard link won't be
>limited to just the immediately preceding snapshot (as is the case
>with my current setup). Instead a hard link could be created to an
>identical file located in a different machine's snapshot.
...
> Is this approach even possible, or am I missing something? I know
>my labs have a lot of duplicate data across many machines, so this
>could save me hundreds of GiBs, maybe even a TiB or two.
It is but the management of it all is up to you; it's not rsync's job.
> If this is possible, how can I save the resulting checksum of a
>file from rsync?
You'll have to use at least rsync v3 in the source machines and in the
backup one you need v3.1. Configure --out-format with %C to have the
md5 in the log. Note that rsync only puts the md5 when it pulls the
file (or you use -c); if it does a hardlink itself the md5 is not
computed, so it's not put in the log.
More information about the rsync
mailing list