Use rsync's checksums to deduplicate across backups

Sun Nov 6 15:29:44 MST 2011

On Thu, Nov 3, 2011 at 7:22 AM, Alex Waite <alexqw85 at gmail.com> wrote:

> >
> > Check out http://backuppc.sourceforge.net/, it's perl-based backup tool,
> > using rsync and doing exactly what you ask for.
> >
>     I have looked at BackupPC before (and it is a nice piece of
> software), and it does hardlink across all backups, but believe it
> does its own checksum on top of what rsync already does.  I imagine
> this would make performance noticeably worse than what I currently
> have, though I could be wrong.
>

An additional checksum (digest) shouldn't change the performance equation
much.  Doesn't rsync do two: one weak & rolling, one strong(er) & not
rolling?  'could be mistaken.

I've been putting together a project along the lines you mention.  It
doesn't use rsync at all, but it's inspired by rsync --link-dest:
http://stromberg.dnsalias.org/~strombrg/backshift/
http://stromberg.dnsalias.org/~strombrg/backshift/documentation/comparison/index.html

It tries to do one thing, well: It deduplicates variable-length,
content-based blocks, and compresses those blocks with xz or bzip2.
Deduplication is done intra- and inter-host.

The variable-length, content-based blocking is nice for torrents and large
logfiles, and other large, slow-growing files - otherwise you (would) end
up requiring space proportionate to the square of the final file length.

It's good at backing up hardlinks (up to 1 million distinct
inodeno+deviceno pairs for now, due to a bloom filter - that's mostly easy
to adjust in the source, again just for now), and doesn't create millions
of hardlinks itself.  IOW, upgrading your backup server need not be a
headache.  It also doesn't treat one hardlink subtree as all links - that
is, it detects hardlinks itself instead of relying on st_nlink from
stat(2).  I've not tried this, but it should even be practical to backup
one repo to another.

Doing a save is a matter of piping find(1) into it (similar to cpio(1)) for
now, and doing a restore is a matter of asking for a tar archive to be
assembled from the compressed pieces and sent to stdout - naturally this is
often piped into ssh(1) or tar(1).  That is, it leverages preexisting tools
pretty well.

This means you don't need a copy of the program on the client to do a
restore - just ssh and tar.  Ditto for backup verification, since GNU tar
has a nifty --diff option.

Despite the use of the tar format for some things, just getting a table of
contents (ToC) doesn't require assembling an entire tar archive; that's
accelerated quite a bit.  The ToC output looks just like tar -tvf, but it's
not doing a tar behind the scenes for ToC's.

It's main selling points are probably the tininess of the repo, and the
non-abuse of hardlinks.  Perhaps I should mention the comprehensive
automated testing and documentation though...

HTH.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20111106/86faba47/attachment.html>