proposal to speed rsync with lots of files

Fri Mar 6 00:58:02 GMT 2009

Peter Salameh wrote:
> My 
> proposal is to first send a checksum of the file list for each 
> directory.  If is found to be identical to the same checksum on the 
> remote side then the list need not be sent for that directory!
...
> It might even be possible to use the rsync checksum algorithm on the 
> directory lists themselves to determine which portion of the directory 
> lists to send, in the case of directories which nearly identical.

Yes, these are both sensible improvements to the algorithm.  Using the
rsync algorithm on the _whole_ file list (not just per directory)
would likely improve it in many cases.  However unlike files, you
don't know the size in advance, so you'd have to pick a block size.
You might have to change the algorithm a little because you'd want to
buffer the checksummed list for transmitting blocks which don't match,
and you'd want to limit the buffer size.

When transmitting the list, it needs to be limited to just the
attributes being compared, and just the files which pass the filter,
of course.

If only a few files have changed in a very large list, then you get
into the interesting problem of "few small changes in very large
stream".  There are tweaks to the rsync algorithm which handle that
better than rsync does.  One of them "recursive rsync" is mentioned in
a paper on the rsync web page.

I've been working on a variation of the rsync algorithm to
delta-transfer arbitrary tree-like data structures at all scales.
That might be a better fit to this problem, but it might be a radical
addition to graft it into rsync and unnecessary for the problem space.

> I would appreciate hearing from rsync developers if this feasible with 
> the current implementation and if they think it would help.

I have no idea if it's feasible in the current implementation without
a lot of work.  It's definitely feasible if you're willing to put the
work in :-)

-- Jamie