Adding support for versioned files in rsync
Jason M. Felice
jfelice at cronosys.com
Tue Oct 14 00:57:36 EST 2003
Hi!
Below is a link to a proposal I'm writing for two clients of ours who want an
Internet-based backup solution. I propose eleven "objectives" in it,
most of which are modifications to rsync. I'd like to contribute
these changes back where possible, and so I'm posting this here for review.
The nuts and bolts of it is the ability to keep multiple copies of files
(think VAX file system, or maybe CVS) and the ability to restore from a
particular date or version.
Feel free to rip it to shreds. I know there are a couple of minor
issues with it already. I've poked around in the source to research
this, but if I'm missing any major sticking-points that someone knows
about off the top of their head, that's something I'd like to know, too.
All feedback is appreciated,
-Jay 'Eraserhead' Felice
SafeSync Backup System
Project Definition
Prepared By: Jason M. Felice
Date: October 11, 2003
Project overview
Inline Technologies, Inc and Gem have approached Cronosys, LLC to discuss
the implementation of an easy-to-use backup mechanism which would meet the
following criteria:
1. Would keep multiple versions of backed-up files and allow several
different versions to be restored (think of Netware's "salvage"
command).
2. Would use as little bandwidth as possible.
3. Would back up to a central server.
4. Would run unattended and grab new revisions of files.
Cronosys researched several possibilities for implementing this system.
The bandwidth requirements suggest the rsync algorithm, so all
possibilities investigated would use the rsync algorithm for transferring
files. The methods investigated were:
* Writing a new protocol which supported the additional features
required.
Some general designs were scratched out, the rsync source code was
inspected for possible traps, and a general design of a new protocol
was worked out. Evaluation of it indicated that this option would be
fairly labor-intensive.
* Using the rsync algorithm on top of another protocol.
XML-RPC over HTTP was investigated as a possible transport for the
system. Several fatal flaws were discovered with this. First, XML's
language encoding prevents sending binary data directly, and XML-RPC's
base64 encoding for binary data inflates the size of the data
unacceptably. Second, file transfers to store to the server would have
to be made in two requests (one to retreive the old file information
and one to post the reconstruction instructions), which becomes more
complicated since XML-RPC is essentially stateless. Third, this
introduces a number of software dependencies which would have to be
managed (web server, HTTP client library, XML-RPC library, SSL
library).
* Extending the existing rsync with these capabilities.
This was decided as the best method for implementation. There are
drawbacks, most notebly that rsync was not designed to handle multiple
versions and there will be some difficulties in retrofitting the
internal structures and the protocol to handle versions.
On the other hand, this has several benefits:
* The code base is already very well tested, and any code which is
integrated back into the project will be very well tested.
* Contributing back to this project is inline with our corporate
philosophy.
* The possibility of collaboration and support from developers
which require similar features.
Project objectives
The SafeSync Backup System project will meet the following objectives:
1. Modify the rsync client and server to support SSL.
The modified client will use SSL for encryption of the protocol
stream. Existing clients can use an external shell program such as SSH
to provide encryption, but this is not portable and it is difficult to
manage.
An "--ssl" option will be added to the rsync program to enable this
feature. This option will be accepted in both client and daemon mode.
2. Write a Windows backup service.
This service will be a wrapper for the rsync program. It will read its
configuration from the registry, then loop forever attempting to back
up configured directories to the server then sleeping a configurable
interval.
The backup service will be aware of changes to its configuration and
adjust its operation appropriately.
3. Write a configuration GUI for the Windows backup service.
This will be a simple program which maintains the registry settings.
It can be used to set the server, authentication information, and
select which directories should be backed up to the server.
4. Make rsync's "storage backend" pluggable.
The rsync program currently only supports storing files to the
filesystem. The goal of this objective is to prepare rsync for a
second storage backend by modularizing the current one. This includes
creating a driver-type interface for the backend, isolating all
filesystem-access functions and packaging them together as the
"filesystem" driver. The driver interface will support file
versioning, although the filesystem driver can only behave as though
one version is stored and the protocol will not yet support
referencing versions other than the most recent.
Several rsyncd.conf configuration parameters will be added to specify
the storage driver and configuration information specific to the
storage driver unique to each rsync module.
5. Write a "versioned" storage backend.
Cronosys will write a new storage backend which is capable of storing
multiple versions of files. This will store file contents in a
directory tree indexed by the file's SHA1 hash, size, and a counter
(in case of SHA1 collisions). File metadata including versioning
information and reference counts for the file contents will be stored
in Berkley DB hashes.
6. Modify the rsync protocol to be version-aware.
There are several aspects to this:
1. Modify data structures to carry version information.
Only the "file_struct" structure should contain this information.
2. Modify protocol to transmit and receive version number when
listing files.
All modifcations here would be made to flist.c.
3. Resolve internal issues with duplicate filename checking.
The design of rsync is such that there are bound to be
complications. Several areas of the system check for duplicate
filenames
7. Add ";version" parsing to command-line.
File references which do not have a version specified in them assume
the latest version of the file (when retreiving) and one greater than
the latest version of the file (when sending). Additionally, an
integer file version can be appended to the file name after a
semicolon to explicitly specify a version. Note that this semicolon
will have to be quoted to be accepted in most shells.
Wildcards will be implemented for specifying versions as well. If ";*"
is found at the end of a filename, it refers to all versions of that
file.
8. Add "--with-versions" command-line option.
Without this option, only the most recent version of files would be
displayed or transferred, except where versions are explicitly
provided. With this option, all versions of files would be displayed
or transferred, except where versions are explicitly specified or when
the target storage driver does not support versions. This can be used
to sync entire repositories with all version information.
9. Add "--as-of" command-line option.
This option will take a timestamp argument. If provided, it changes
the behavior of the logic which determines the most recent version of
a file so that it will never determine a version created after the "as
of" timestamp. This will allow "time travel," so that complete sets of
historic files can be checked out.
10. Write a restore GUI for Windows.
This program will be a graphical wrapper for the rsync command. It
will read the backup service's configuration to find the server and
get authentication information, then it will display backed-up
versions of files in a file-browser type interface. It will allow the
user to select files and versions to restore.
11. Create Windows installer.
Cronosys will create the files required to build a self-installing
executable for Windows which contains all of the programs, utilities,
and data files required to back up to a server.
--
Jason M. Felice
Cronosys, LLC <http://www.cronosys.com/>
216.221.4600 x302
More information about the rsync
mailing list