too many connections on one module?

Thu Apr 27 21:11:41 GMT 2006

On Thu, 2006-04-27 at 12:13 -0700, kimor79 at sonic.net wrote:
> The second pass is the problem. On this pass the clients are syncing about
> 10 files from the previously excluded dir only. For some reason rsync
> sessions take an excessively long time to complete (hours as opposed to
> minutes). Eventually the connections pile up and load on the server has
> gotten as high as 900!

This previously excluded directory has a lot of big files, right?  In
rsync's incremental transfer protocol, the sender uses a considerable
amount of CPU time: it scans each file byte by byte, looking for regions
matching the block hashes it got from the receiver.  This usually isn't
a problem, but with 1000 clients at a time, I can see why the
connections back up.

Having the clients disable incremental transfer on the second pass by
passing --whole-file will remove the CPU load but increase network
traffic.

Notes for rsync developers:

This problem highlights the need for several changes to rsync's
incremental transfer protocol.  I believe the sender should compute and
send the block hashes.  Then the receiver should scan byte-by-byte,
transferring matched data to the temporary file and informing the sender
of what literal data it needs.  Then the sender should send the
necessary literal data.  The cost is a third stage to the pipeline:
"send sums, request blocks, send blocks" instead of "send sums, send
blocks and matching information".  There are several benefits:

(1) Less information about the soon-to-be-overwritten receiver version
of the file leaks from receiver to sender.  The sender only finds out
what parts of its file the receiver needs, not the hash of every block
the receiver currently has.  This is important when the receiver doesn't
trust the sender.

(2) The receiver does the hard work of the byte-by-byte scan.  This
distributes load when there is one sender and many receivers.  This case
is much more common than the other way around.

(3) The sender, if it wishes to further reduce its load, could compute
the block hashes of its files once and send them in bulk to receivers.
I suggest storing block hashes in cache files kept alongside the data
files or in a separate hierarchy.  The sender could decide whether to
trust a block hash file by storing the size and mtime of the original
file in the data of the block hash file and using the quick check.

-- 
Matt McCutchen
hashproduct at verizon.net
http://hashproduct.metaesthetics.net/