Future RSYNC enhancement/improvement suggestions

Fri Apr 19 10:42:02 EST 2002

Jan Rafaj [rafaj at cedric.vabo.cz] writes:

>   How about adding a feature to keep the checksums in a berkeley-style
>   database somewhere on the HDD separately, and with subsequent
>   mirroring attempts, look to it just for the checksums, so that
>   the rsync does not need to do checksumming of whole target
>   (already mirrored) file tree ?

There's a chicken and egg issue with this - how do you know that the
separately stored checksum accurately reflects the file which it
represents?  Once they are stored separately they can get out of sync.
The natural way to verify the checksum would be to recompute it, but
then you're sort of back to square one.  I know there have been
discussions about this sort of thing on the list in the past.

For multiple similar distributions, the rsync+ work (recently
incorporated into the mainline rsync in experimental mode - the
write-batch and read-batch options) helps remove repeated computations
of the checksums and deltas, but it's not a generalized system for any
random transfer.

I've wanted similar benefits because we use dialup to remote locations
and for databases with hundreds of MB or 1-2 GB, we end up wasting a
bit of phone time when both sides are just computing checksums.  But
I'm not sure of a good generalized solution.  There may be platform
specific hacks (e.g., under NT, storing the computed checksum in a
separate stream in the file, so it's guaranteed to be associated with
the file), but I don't know of a portable way to link meta information
with filesystem files.

Note that if you aren't already, be sure that you up the default
blocksize for large files - that can cut down significantly on both
checksum computation time as well as meta data transferred over the
session, since there are fewer blocks that need two checksums (weak +
MD4) apiece.

> - make output of error & status messages from rsync uniformed,
>   so that it could be easily parsed by scripts (it is not right
>   now - rsync 2.5.5)

I know Martin has expressed some interest to the list in having something
like this in the future as an option.

> - perhaps if the network connection between rsync client and server
>   stalls for some reason, implement something like 'tcp keepalive'
>   feature ?

I think rsync is pretty complicated at the network level already - it
seems reasonable to me that rsync ought to be able to assume that the
lowest level network protocol stack will get the data to the other end
and/or give an error if something goes wrong without needing a lot of
babysitting.

In all but the rsync server cases, rsync doesn't control the network
stream itself anyway (it just has a child process using ssh, rsh or
anything else), so it becomes a question for that particular utility
and not something rsync can do anything about.

In the rsync server case, it already sets the TCP KEEPALIVE option at
the socket level when it receives a connection.

If your network transport between systems is problematic, there's a
limited about of stuff rsync can do about it.  Oh and no, just being
idle on a session shouldn't terminate it, no matter how long rsync
takes to compute checksums.  So if that's happening to you, you might
want to investigate your network connectivity.  Or perhaps you're
going through a NAT or some sort of proxy box that places a timeout on
TCP sessions that you can increase?

Upon failures, if you use --partial and a separate destination
directory you can keep re-trying and slowly get the whole file across
(that's how we do our backups) but you do still need to recompute
checksums each time.  It might be nice to see if rsync itself could
have a retry mechanism that would re-use the existing checksum
information it had computed previously.  I have a feeling with the
structure of the code at this point though that doing so would be
reasonably complicated.

The caveat to --partial is that once you have a partial file, even
with --compare-dest, that partial file is all rsync considers for the
remaining portion of the transfer.  So originally for our database
backups, I was removing any partial copy manually if it was less than
some fraction of the previous copy I already had, since I'd lose less
time rebuilding that fraction than losing access to the entire prior
file.

In response to that, there was another internal-use patch I made to
rsync to "--partial-pad" any partial file with data from the original
file on the destination system during an error.  No guarantees it
would work as well, since I just took data from the original file past
the size point of the partial copy, but in many cases (growing files)
its a big win.  If anyone is interested, I could extract it and post
it.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/