hashproduct+rsync at gmail.com
Tue Oct 2 02:34:43 GMT 2007
On 9/30/07, Stephen Zemlicka <stevezemlicka at gmail.com> wrote:
> OK, let's say this is the first sync and every file is being transferred.
> The checksum for each of the files is cached on the local drive. Then, the
> next time you sync, it checks the checksum from the cache against the file
> to be copied. If it matches, it skips it. If it doesn't match, it just
> transfers just the difference. It then replaces the checksum of that
> transferred file to the cache. That way one could have a remote data store
> and not have to run rsync on the remote system. IE, you could have a mapped
> drive or FTP folder or S3 storage area that would all be rsyncable.
That's a very clever idea, but I'd like to point out two caveats:
(1) You're assuming nobody else modifies the files on the mapped
drive. To remove that assumption, the checksum cache for each remote
file could store the mtime of the revision of the file for which the
cache is valid. Then, a destination file whose checksum cache is
invalid could be identified and updated with a whole-file transfer.
Optionally, you could store the caches on the mapped drive instead of
the client, allowing anyone to push efficiently to the drive.
(2) --inplace must be used. Furthermore, you save bandwidth only when
a block of the destination file matches the source file *at the same
offset*. If the offsets differ, a real delta transfer can just
instruct the receiver to move the data, but in your case the data has
to be written over again at the new offset. Thus, your scheme will
give almost as much benefit as a real delta transfer for a
database-style file that is modified in place, but if a single byte is
inserted or deleted at the beginning of the source file, your scheme
has to rewrite the entire destination file. You could overcome this
by uploading a delta instead of updating the file itself, but that
complicates matters for readers, who then have to pull the file and
If the remote filesystem supports efficient copying of a range of data
from one offset to another, then #2 is moot and a smart client can do
both pushes and pulls efficiently using your scheme and zsync's
"reverse" delta-transfer algorithm, respectively. S3 doesn't appear
to support any kind of range manipulation; perhaps Amazon could be
convinced to add the necessary support.
More information about the rsync