Rsyncing really large files

Sat Mar 5 17:18:56 GMT 2005

On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote:
> And I'm suggesting making it static, by adjusting the hash table's
> size according to the number of blocks.

The block-size?  No thanks.  The smaller the block-size for a particular
file size, the more checksum data needs to be generated and sent, and
that would mean that the size would bloat too much for larger files if
it were constant.  It is better to tune it on a per-file basis.

Improving the hash-searching for checksums is certainly possible, and
may well reap some nice speed benefits.  For instance, I noticed the
other day in a project called zsync (see below for a quick summary)
that the developer, Colin Phipps, decided to create a negative hash
table with a higher bit resolution.  Since it only needs to note a
found/not-found state, the table can be a single bit per node, and a
19-bit lookup only needs 64k of memory.  This allows a rapid yes/no
pre-check for the weak value before we look-up the actual strong
checksum value in the hash table and should result in less searching
for values that aren't there.  It might be interesting to run some
tests and see how often that is a net win.  Obviously, if a file has
changed a lot, it will have a lot of "not there" sums, but a file that
is nearly identical will have very few non-matches, so it is hard to
guess how well that performs with typical file sets (zsync is primarily
concerned with larger files, such as iso files, so it's probably a good
choice for that).

I haven't yet looked at the zsync code, but the project looks to be
interesting.  It choose to reverse the flow of checksum data, allowing
it to be pre-computed for files (an idea that has been bandied about
before, but I haven't heard of any prior implementations).  This lets
someone use HTTP to serve up both the checksum data and the file itself,
lightening the CPU load on the server (since it doesn't need to do any
of the checksum computations), but increasing the amount of data that
needs to flow from the server to the client compared with a normal rsync
algorithm.  I found it here:  http://zsync.moria.org.uk/

..wayne..