rsync and debian -- summary of issues

Thu Apr 11 23:42:02 EST 2002

On Thu, 11 Apr 2002, Martin Pool wrote:

> I'd appreciate comments.

Hmm...

As you may know I'm both the APT author, administrator of the top level
debian mirrors and associated mirror network. So,

> 3.2 rsync is too hard on servers
> If it is, then I think we should fix the problems, rather than
> invent a new system from scratch. I think the scalability problems
> are accidents of the current codebase, rather than anything inherent
> in the design.

It's true I'm afraid. Currently on ftp.d.o:

nobody    8835 25.7  0.3 22120 1740 ?        RN   Apr10 525:24 rsync --daemon
nobody   22896  5.0  0.3 22828 1992 ?        SN   Apr11  21:20 rsync --daemon
nobody    3907  7.3  0.5 22336 2820 ?        RN   Apr11  15:30 rsync --daemon
nobody   10729 13.7  4.0 22308 20904 ?       RN   Apr11  13:10 rsync --daemon

The load average is currently > 7 all due to rsync. I'm not sure what that
one that has sucked up 500mins is actually doing, but I've come to accept
that as 'normal'. I expect some client has asked it to recompute every
checksum for the entire 30G of data and it's just burning away processor
power <sigh>.

We tend to allow only 10-15 simulataneous rsync connections because of
this.

Things are better now, in the past with 2.2 kernels and somewhat slower
disks rsync would not just suck up CPU power but it would seriously hit
the drives as well. I think the improvements in inode/dentry caching in
2.4, and our new archive structure are largely responsible for making that
less noticable.

IMHO as long as rsync continues to have a server heavy design it's ability
to scale is going to be quite poor. Right now there are 91 people
connected to  ftp/http on ftp.d.o, if they were using rsync's I'm sure the
poor server would be quite dead indeed.

> 3.1 Compressed files cannot be differenced

I recall seeing some work done to determine how much savings you could
expect if you used xdeltas of the uncompressed data. This would be the
best result you could expect from gzip --rsyncable. I recall the numbers 
were disapointing, it was << 50% on average or somesuch. It would be nice
if someone could find that email or repeat the experiments.

> 3.5 Goswin Brederlow's proposal to use the reverse rsync algorithm over
> HTTP Range requests

Several years ago I suggested this in a conversation with you on one of
the rsync lists, someone else was able to pull a reference to the IBM
patent database and claimed it was the particular patent that prohibits
the server-friendly reverse implementation.

> 3.7 rsync uses too much memory

This only really seems to be true for tree-mirroring, the filelists can be
very big indeed.

Jason