Use rsync's checksums to deduplicate across backups
drsalists at gmail.com
Sun Nov 6 15:29:44 MST 2011
On Thu, Nov 3, 2011 at 7:22 AM, Alex Waite <alexqw85 at gmail.com> wrote:
> > Check out http://backuppc.sourceforge.net/, it's perl-based backup tool,
> > using rsync and doing exactly what you ask for.
> I have looked at BackupPC before (and it is a nice piece of
> software), and it does hardlink across all backups, but believe it
> does its own checksum on top of what rsync already does. I imagine
> this would make performance noticeably worse than what I currently
> have, though I could be wrong.
An additional checksum (digest) shouldn't change the performance equation
much. Doesn't rsync do two: one weak & rolling, one strong(er) & not
rolling? 'could be mistaken.
I've been putting together a project along the lines you mention. It
doesn't use rsync at all, but it's inspired by rsync --link-dest:
It tries to do one thing, well: It deduplicates variable-length,
content-based blocks, and compresses those blocks with xz or bzip2.
Deduplication is done intra- and inter-host.
The variable-length, content-based blocking is nice for torrents and large
logfiles, and other large, slow-growing files - otherwise you (would) end
up requiring space proportionate to the square of the final file length.
It's good at backing up hardlinks (up to 1 million distinct
inodeno+deviceno pairs for now, due to a bloom filter - that's mostly easy
to adjust in the source, again just for now), and doesn't create millions
of hardlinks itself. IOW, upgrading your backup server need not be a
headache. It also doesn't treat one hardlink subtree as all links - that
is, it detects hardlinks itself instead of relying on st_nlink from
stat(2). I've not tried this, but it should even be practical to backup
one repo to another.
Doing a save is a matter of piping find(1) into it (similar to cpio(1)) for
now, and doing a restore is a matter of asking for a tar archive to be
assembled from the compressed pieces and sent to stdout - naturally this is
often piped into ssh(1) or tar(1). That is, it leverages preexisting tools
This means you don't need a copy of the program on the client to do a
restore - just ssh and tar. Ditto for backup verification, since GNU tar
has a nifty --diff option.
Despite the use of the tar format for some things, just getting a table of
contents (ToC) doesn't require assembling an entire tar archive; that's
accelerated quite a bit. The ToC output looks just like tar -tvf, but it's
not doing a tar behind the scenes for ToC's.
It's main selling points are probably the tininess of the repo, and the
non-abuse of hardlinks. Perhaps I should mention the comprehensive
automated testing and documentation though...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rsync