Adding support for versioned files in rsync

jw schultz jw at pegasys.ws
Tue Oct 14 03:44:52 EST 2003


On Mon, Oct 13, 2003 at 10:57:36AM -0400, Jason M. Felice wrote:
> Hi!
> 
> Below is a link to a proposal I'm writing for two clients of ours who want an
> Internet-based backup solution.  I propose eleven "objectives" in it,
> most of which are modifications to rsync.  I'd like to contribute
> these changes back where possible, and so I'm posting this here for review.
> 
> The nuts and bolts of it is the ability to keep multiple copies of files
> (think VAX file system, or maybe CVS) and the ability to restore from a
> particular date or version.
> 
> Feel free to rip it to shreds.  I know there are a couple of minor
> issues with it already.  I've poked around in the source to research
> this, but if I'm missing any major sticking-points that someone knows
> about off the top of their head, that's something I'd like to know, too.
> 
> All feedback is appreciated,
> -Jay 'Eraserhead' Felice
> 
> 
>                              SafeSync Backup System
> 
>                                Project Definition
> 
>    Prepared By: Jason M. Felice
> 
>    Date: October 11, 2003
> 
>                                 Project overview
> 
>    Inline Technologies, Inc and Gem have approached Cronosys, LLC to discuss
>    the implementation of an easy-to-use backup mechanism which would meet the
>    following criteria:
> 
>     1. Would keep multiple versions of backed-up files and allow several
>        different versions to be restored (think of Netware's "salvage"
>        command).
> 
>     2. Would use as little bandwidth as possible.
> 
>     3. Would back up to a central server.
> 
>     4. Would run unattended and grab new revisions of files.

Try dirvish or one of the other backup systems already out
there.

>    Cronosys researched several possibilities for implementing this system.
>    The bandwidth requirements suggest the rsync algorithm, so all
>    possibilities investigated would use the rsync algorithm for transferring
>    files. The methods investigated were:
> 
>      * Writing a new protocol which supported the additional features
>        required.
> 
>        Some general designs were scratched out, the rsync source code was
>        inspected for possible traps, and a general design of a new protocol
>        was worked out. Evaluation of it indicated that this option would be
>        fairly labor-intensive.

I don't say the rsync protocol is a good one but you don't
say why you need your own protocol.  Of course if you wish
to pay developers for the next three years...

>      * Using the rsync algorithm on top of another protocol.
> 
>        XML-RPC over HTTP was investigated as a possible transport for the
>        system. Several fatal flaws were discovered with this. First, XML's
>        language encoding prevents sending binary data directly, and XML-RPC's
>        base64 encoding for binary data inflates the size of the data
>        unacceptably. Second, file transfers to store to the server would have
>        to be made in two requests (one to retreive the old file information
>        and one to post the reconstruction instructions), which becomes more
>        complicated since XML-RPC is essentially stateless. Third, this
>        introduces a number of software dependencies which would have to be
>        managed (web server, HTTP client library, XML-RPC library, SSL
>        library).
> 
>      * Extending the existing rsync with these capabilities.
> 
>        This was decided as the best method for implementation. There are
>        drawbacks, most notebly that rsync was not designed to handle multiple
>        versions and there will be some difficulties in retrofitting the
>        internal structures and the protocol to handle versions.
> 
>        On the other hand, this has several benefits:
> 
>           * The code base is already very well tested, and any code which is
>             integrated back into the project will be very well tested.
> 
>           * Contributing back to this project is inline with our corporate
>             philosophy.

We will be appreciative of bug fixes and enhancements that
are of general value.  We do not care for enhancements just
with the sole value of supporting a commercial application.

>           * The possibility of collaboration and support from developers
>             which require similar features.

s/possibility/requirement/  if you want the patches
accepted.

> 
>                                Project objectives
> 
>    The SafeSync Backup System project will meet the following objectives:
> 
>     1. Modify the rsync client and server to support SSL.
> 
>        The modified client will use SSL for encryption of the protocol
>        stream. Existing clients can use an external shell program such as SSH
>        to provide encryption, but this is not portable and it is difficult to
>        manage.
> 
>        An "--ssl" option will be added to the rsync program to enable this
>        feature. This option will be accepted in both client and daemon mode.

Good idea, see the archives for earlier attempts.  If you
can do it without thrashing the codebase we'd love to have
it.

>     2. Write a Windows backup service.
> 
>        This service will be a wrapper for the rsync program. It will read its
>        configuration from the registry, then loop forever attempting to back
>        up configured directories to the server then sleeping a configurable
>        interval.
> 
>        The backup service will be aware of changes to its configuration and
>        adjust its operation appropriately.
> 
>     3. Write a configuration GUI for the Windows backup service.
> 
>        This will be a simple program which maintains the registry settings.
>        It can be used to set the server, authentication information, and
>        select which directories should be backed up to the server.

Windows is not a priority (speaking for myself).  It is a
legacy system.  <troll, troll, troll your boat...>

By the time you have rebuilt rsync for SSL and other stuff
perhaps a windows native port won't seem so extreme.  For
now cygwin is sufficient.

The next generation of rsync will likely be built with
consideration for more easily supporting Windows.

>     4. Make rsync's "storage backend" pluggable.
> 
>        The rsync program currently only supports storing files to the
>        filesystem. The goal of this objective is to prepare rsync for a
>        second storage backend by modularizing the current one. This includes
>        creating a driver-type interface for the backend, isolating all
>        filesystem-access functions and packaging them together as the
>        "filesystem" driver. The driver interface will support file
>        versioning, although the filesystem driver can only behave as though
>        one version is stored and the protocol will not yet support
>        referencing versions other than the most recent.
> 
>        Several rsyncd.conf configuration parameters will be added to specify
>        the storage driver and configuration information specific to the
>        storage driver unique to each rsync module.

Ever hear of vfs?  I have considered pluggability and might
consider it for the next version of rsync (3.x or perhaps 2.6)

>     5. Write a "versioned" storage backend.
> 
>        Cronosys will write a new storage backend which is capable of storing
>        multiple versions of files. This will store file contents in a
>        directory tree indexed by the file's SHA1 hash, size, and a counter
>        (in case of SHA1 collisions). File metadata including versioning
>        information and reference counts for the file contents will be stored
>        in Berkley DB hashes.

VFS can already provide this.

>     6. Modify the rsync protocol to be version-aware.
> 
>        There are several aspects to this:
> 
>          1. Modify data structures to carry version information.
> 
>             Only the "file_struct" structure should contain this information.
> 
>          2. Modify protocol to transmit and receive version number when
>             listing files.
> 
>             All modifcations here would be made to flist.c.
> 
>          3. Resolve internal issues with duplicate filename checking.
> 
>             The design of rsync is such that there are bound to be
>             complications. Several areas of the system check for duplicate
>             filenames
> 
>     7. Add ";version" parsing to command-line.
> 
>        File references which do not have a version specified in them assume
>        the latest version of the file (when retreiving) and one greater than
>        the latest version of the file (when sending). Additionally, an
>        integer file version can be appended to the file name after a
>        semicolon to explicitly specify a version. Note that this semicolon
>        will have to be quoted to be accepted in most shells.
> 
>        Wildcards will be implemented for specifying versions as well. If ";*"
>        is found at the end of a filename, it refers to all versions of that
>        file.
> 
>     8. Add "--with-versions" command-line option.
> 
>        Without this option, only the most recent version of files would be
>        displayed or transferred, except where versions are explicitly
>        provided. With this option, all versions of files would be displayed
>        or transferred, except where versions are explicitly specified or when
>        the target storage driver does not support versions. This can be used
>        to sync entire repositories with all version information.
> 
>     9. Add "--as-of" command-line option.
> 
>        This option will take a timestamp argument. If provided, it changes
>        the behavior of the logic which determines the most recent version of
>        a file so that it will never determine a version created after the "as
>        of" timestamp. This will allow "time travel," so that complete sets of
>        historic files can be checked out.

Forget it.  Version awareness will just bog things down.
They did it on VMS and it has mainly served to be a
full-employment feature for admins.  If you want to use
version aware filesystems fine, rsync doesn't need to be
aware of that.  At most rsync might want to be able to
detect renames or, perhaps through plugin, be able to select
whether to see all versions with versioning disabled or just
latest.

>    10. Write a restore GUI for Windows.
> 
>        This program will be a graphical wrapper for the rsync command. It
>        will read the backup service's configuration to find the server and
>        get authentication information, then it will display backed-up
>        versions of files in a file-browser type interface. It will allow the
>        user to select files and versions to restore.
> 
>    11. Create Windows installer.
> 
>        Cronosys will create the files required to build a self-installing
>        executable for Windows which contains all of the programs, utilities,
>        and data files required to back up to a server.

If you are going to create a slick shrink wrap doohicky for
windows those will be needed.

Forgive me if i have abused and mocked your proposal.  You
have outlined an ambitious project.  Some of the ideas have
merit.  One or two would be nice to see in rsync.  The rsync
team are a few unpaid volunteers.  It sounds to me like you
propose creating a monstrosity out of rsync for the benefit
of one piece of vaporware.  Doing so on the backs of unpaid
volunteers rankles as would hijacking rsync.  

I can see three ways to go that relate to rsync.

1. Wrap rsync which communicates very nicely over pipes and
contribute a few general purpose enhancements.

2. Use librsync or the theoretical work in Tridge's thesis
to build something that uses the "rsync algorithm" but is
unrelated to rsync the utility.

3. Coordinate with the rsync team to produce the next
generation of rsync.  This would be a leaner, meaner and
more modular rsync that would be much easer to interface
with.  We have had numerous in-depth design discussions and
there have even been some prototypes built to test ideas.
You might even consider funding some of our work towards that
goal.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt



More information about the rsync mailing list