superlifter design notes and a new proposal

Mon Aug 5 22:19:01 EST 2002

> One interesting idea, further down the track, would be to introduce a
> virtual filesystem layer into rsync 3, similar to that in Samba, so
> that all disk IO goes through a function-table layer.
> 
> You could then write a filesystem layer that, rather than talking to
> the native filesystem, talks to some kind of database.  This would be
> very nice for a backup server for a couple of reasons.  (It could be a
> Sleepycat DB or SQL or whatever.)
> 
> Firstly, the database could implement arbitrarily rich filesystem
> semantics without being limited by the native filesystem.  For
> example, you couldn't back up OS X resource forks to a Solaris
> machine, but you can easily store them inside a database.  Similarly
> for storing extended attributes, or storing NT security descriptors on
> Unix (or vice versa).
> 
> Secondly, the database could be tuned for the particular case of
> storing incremental backups: between one version of a file and the
> next, you could just store an xdelta diff, and identical files could
> be replaced by something similar to hardlinks.  The whole thing could
> be compressed.  Basically you would be tuning for the case of an
> append-only filesystem.

This would be great.  <shameless_plug> You should check out BackupPC
at http://backuppc.sourceforge.net </shameless_plug>. This implements
backup storage by hardlinking identical files and storing meta data
like attributes separately.  Files are also compressed.  Some rough
numbers: backing up about 1.2Tb of total data (3 fulls + 6 incrs) from
100 desktops requires about only about 150GB of storage (YMMV).  Most
of the benefit is from storing repeated files only once.  Compression
provides roughly a 40% benefit.

By running some tests on this data set I estimated that storing
reverse deltas as well would save an additional 15% or so.  Not
a huge improvement and from my perspective not worth the effort.
But if the data being backed up involves lots of small changes
to large files then reverse deltas would help more.

Anyhow, abstracting the file system interface in rsync would be huge
win for me.  I'm currently implementing one side of rsync (in perl) so
I can support my compressed, hardlinked backup file system structure
and still talk to a vanilla rsync on the client.  This will be a big
improvement over tar or smbclient that is currently used to extract
the backup data.  I will also cache the block and file checksums, so
backups can involve almost no server disk reads for unchanged files.
If rsync had a clear file system API then my job would be much
easier!

Craig