Adding support for versioned files in rsync

Jason M. Felice jfelice at cronosys.com
Tue Oct 14 00:57:36 EST 2003


Hi!

Below is a link to a proposal I'm writing for two clients of ours who want an
Internet-based backup solution.  I propose eleven "objectives" in it,
most of which are modifications to rsync.  I'd like to contribute
these changes back where possible, and so I'm posting this here for review.

The nuts and bolts of it is the ability to keep multiple copies of files
(think VAX file system, or maybe CVS) and the ability to restore from a
particular date or version.

Feel free to rip it to shreds.  I know there are a couple of minor
issues with it already.  I've poked around in the source to research
this, but if I'm missing any major sticking-points that someone knows
about off the top of their head, that's something I'd like to know, too.

All feedback is appreciated,
-Jay 'Eraserhead' Felice


                             SafeSync Backup System

                               Project Definition

   Prepared By: Jason M. Felice

   Date: October 11, 2003

                                Project overview

   Inline Technologies, Inc and Gem have approached Cronosys, LLC to discuss
   the implementation of an easy-to-use backup mechanism which would meet the
   following criteria:

    1. Would keep multiple versions of backed-up files and allow several
       different versions to be restored (think of Netware's "salvage"
       command).

    2. Would use as little bandwidth as possible.

    3. Would back up to a central server.

    4. Would run unattended and grab new revisions of files.

   Cronosys researched several possibilities for implementing this system.
   The bandwidth requirements suggest the rsync algorithm, so all
   possibilities investigated would use the rsync algorithm for transferring
   files. The methods investigated were:

     * Writing a new protocol which supported the additional features
       required.

       Some general designs were scratched out, the rsync source code was
       inspected for possible traps, and a general design of a new protocol
       was worked out. Evaluation of it indicated that this option would be
       fairly labor-intensive.

     * Using the rsync algorithm on top of another protocol.

       XML-RPC over HTTP was investigated as a possible transport for the
       system. Several fatal flaws were discovered with this. First, XML's
       language encoding prevents sending binary data directly, and XML-RPC's
       base64 encoding for binary data inflates the size of the data
       unacceptably. Second, file transfers to store to the server would have
       to be made in two requests (one to retreive the old file information
       and one to post the reconstruction instructions), which becomes more
       complicated since XML-RPC is essentially stateless. Third, this
       introduces a number of software dependencies which would have to be
       managed (web server, HTTP client library, XML-RPC library, SSL
       library).

     * Extending the existing rsync with these capabilities.

       This was decided as the best method for implementation. There are
       drawbacks, most notebly that rsync was not designed to handle multiple
       versions and there will be some difficulties in retrofitting the
       internal structures and the protocol to handle versions.

       On the other hand, this has several benefits:

          * The code base is already very well tested, and any code which is
            integrated back into the project will be very well tested.

          * Contributing back to this project is inline with our corporate
            philosophy.

          * The possibility of collaboration and support from developers
            which require similar features.

                               Project objectives

   The SafeSync Backup System project will meet the following objectives:

    1. Modify the rsync client and server to support SSL.

       The modified client will use SSL for encryption of the protocol
       stream. Existing clients can use an external shell program such as SSH
       to provide encryption, but this is not portable and it is difficult to
       manage.

       An "--ssl" option will be added to the rsync program to enable this
       feature. This option will be accepted in both client and daemon mode.

    2. Write a Windows backup service.

       This service will be a wrapper for the rsync program. It will read its
       configuration from the registry, then loop forever attempting to back
       up configured directories to the server then sleeping a configurable
       interval.

       The backup service will be aware of changes to its configuration and
       adjust its operation appropriately.

    3. Write a configuration GUI for the Windows backup service.

       This will be a simple program which maintains the registry settings.
       It can be used to set the server, authentication information, and
       select which directories should be backed up to the server.

    4. Make rsync's "storage backend" pluggable.

       The rsync program currently only supports storing files to the
       filesystem. The goal of this objective is to prepare rsync for a
       second storage backend by modularizing the current one. This includes
       creating a driver-type interface for the backend, isolating all
       filesystem-access functions and packaging them together as the
       "filesystem" driver. The driver interface will support file
       versioning, although the filesystem driver can only behave as though
       one version is stored and the protocol will not yet support
       referencing versions other than the most recent.

       Several rsyncd.conf configuration parameters will be added to specify
       the storage driver and configuration information specific to the
       storage driver unique to each rsync module.

    5. Write a "versioned" storage backend.

       Cronosys will write a new storage backend which is capable of storing
       multiple versions of files. This will store file contents in a
       directory tree indexed by the file's SHA1 hash, size, and a counter
       (in case of SHA1 collisions). File metadata including versioning
       information and reference counts for the file contents will be stored
       in Berkley DB hashes.

    6. Modify the rsync protocol to be version-aware.

       There are several aspects to this:

         1. Modify data structures to carry version information.

            Only the "file_struct" structure should contain this information.

         2. Modify protocol to transmit and receive version number when
            listing files.

            All modifcations here would be made to flist.c.

         3. Resolve internal issues with duplicate filename checking.

            The design of rsync is such that there are bound to be
            complications. Several areas of the system check for duplicate
            filenames

    7. Add ";version" parsing to command-line.

       File references which do not have a version specified in them assume
       the latest version of the file (when retreiving) and one greater than
       the latest version of the file (when sending). Additionally, an
       integer file version can be appended to the file name after a
       semicolon to explicitly specify a version. Note that this semicolon
       will have to be quoted to be accepted in most shells.

       Wildcards will be implemented for specifying versions as well. If ";*"
       is found at the end of a filename, it refers to all versions of that
       file.

    8. Add "--with-versions" command-line option.

       Without this option, only the most recent version of files would be
       displayed or transferred, except where versions are explicitly
       provided. With this option, all versions of files would be displayed
       or transferred, except where versions are explicitly specified or when
       the target storage driver does not support versions. This can be used
       to sync entire repositories with all version information.

    9. Add "--as-of" command-line option.

       This option will take a timestamp argument. If provided, it changes
       the behavior of the logic which determines the most recent version of
       a file so that it will never determine a version created after the "as
       of" timestamp. This will allow "time travel," so that complete sets of
       historic files can be checked out.

   10. Write a restore GUI for Windows.

       This program will be a graphical wrapper for the rsync command. It
       will read the backup service's configuration to find the server and
       get authentication information, then it will display backed-up
       versions of files in a file-browser type interface. It will allow the
       user to select files and versions to restore.

   11. Create Windows installer.

       Cronosys will create the files required to build a self-installing
       executable for Windows which contains all of the programs, utilities,
       and data files required to back up to a server.

-- 
 Jason M. Felice
 Cronosys, LLC <http://www.cronosys.com/>
 216.221.4600 x302



More information about the rsync mailing list