how fast should rsync be?

Martin Pool mbp at samba.org
Fri Jan 11 17:14:06 EST 2002


People often report rsync taking several hours to transfer last
filesystems, and want to know whether there is anything they can do to
make it work faster.  

In some cases of course there are straightforward known solutions:

 - run rsync's own network protocol, rather than using an NFS-mounted
   filesystem

 - if the network bandwidth is large, don't use compression so as to
   reduce CPU overhead

 - if security considerations permit, use rsyncd rather than rsync/ssh

 - turn on timestamps with -a

Anecdotally, rsync seems to be a little unscalable in terms of time
and memory for large trees of files.  Since some of the solutions
might be a bit complicated, it would be nice to work out roughly ahead
of time how much potential for improvment exists.

In most of the situations where people are concerned about speed or
memory usage they are not so much concerned with rsync's delta
algorithm, but rather copying file trees in which most of the files
are either already correct or do not exist at all.  

So we can compare rsync's performance to other simpler programs that
achieve the same results to obtain comparisons for two boundary cases:

 (A) all files already exist and have the correct timestamp and size,
     so no data need be transferred
 
 (B) none of the destination files exist, so all the data must be
     transferred 

In case (A), rsync needs to traverse both directories recursively and
stat each file to get the size and mtime on all files.  So this is
very roughly similar to doing "ls -lR" on *both* source and
destination directories.  Perhaps to be fair to rsync we ought to pipe
the results from the two directory listings through diff to allow for
the fact that at least one of the listing has to be sent from one
process to another, and that some kind of comparison must be done.

In case (B), rsync needs to traverse the source directory, read all
files and metadata, and write them out into the destination.  This is
similar to running tar on the source piped into tar on the
destination, with the destination initially empty.

The most likely reason for rsync to be slower than the other programs
in both these cases is its approach of reading the entire directory
into memory at the begining.  I think this should only really be a
problem if the machine is too low on memory to hold the whole file
list.  If that is the case we can either try to hold the file list
more efficiently (there is some scope for this), or change rsync to
not keep the file list, or do something else.

Hardlink handling is very inefficient at the moment, and I have the
start of a patch to address that.

In both cases we want to know about CPU and elapsed time, and perhaps
also VM usage. 

I'm going to run some tests along these lines on my machine, and am
interested in seeing other results or comments on why they're not a
good way to do these measurements.

-- 
Martin




More information about the rsync mailing list