efficient file appends

Thu Dec 13 11:02:44 EST 2001

On 12 Dec 2001, rsync at ka9q.net wrote:
> >While potentially a useful option, you wouldn't want the protocol to
> >automatically always check for it, since it would preclude rsync on
> 
> This extension need not break any existing mechanism; if the hash of
> the receiver's copy of the file doesn't match the start of the
> sender's file, the protocol would continue as before.

The way the protocol works is that there are three threads of
operation:

  the generator (running on the destination) transmits checksums for
  all files that are present

  the sender (running on the source) sends deltas 

  the receiver (running on the destination) applies the deltas to
  generate the new file

I think what you're proposing is that the generator should first send
a checksum of the whole file, and wait to see if the sender thinks it
matches the start of it's copy of the file.  If not, the generator
should send another set of checksums with a smaller block size.

If you started implementing this, which you're very welcome to do,
then I think you would have a problem that at the moment the generator
only writes and does not read from the network.  So you'd need a way
for the sender to say "those checksums are not good enough, go around
again."  This could probably be done.

Of course this will introduce more roundtrips, which is even worse on
cellular or long links.  We can pipeline across multiple files,
assuming there are multiple files, but it's not clear that it's
justified.

You can also imagine a new checksum format that can be subdivided to
find not only matching blocks, but also matching half blocks.  I think
that's an open research question.

> >Alternatively, even with rsync the way it is today, what I do is
> >manually bump up the blocksize to something large (say 16 or 32K).
> 
> This sounds like an excellent idea, and I'll give it a try. As the
> blocksize reaches the receiver's file size, the scheme essentially
> approaches my idea.

The last block transmitted is a short block, and rsync specially
allows for it to match other short regions inside the destination
file.  So if you set the block size much larger than the file in
question, rsync will search only for appended data.  (How cool!  I
hadn't really thought about it that way before.)

I just tested this, and it looks like it works.

Of course the drawback is that any regions *have* been deleted or
moved, rsync will retransmit the whole file.  (You asked for it, you
got it.)  But for the common case of appended files or interrupted
transfers it's pretty good.

I might add --block-size=max or --block-size=1M.

You should also check out --partial and -P.

-- 
Martin