how fast should rsync be?

Sat Jan 12 08:49:20 EST 2002

> 
> Message: 5
> To: rsync at samba.org
> Subject: how fast should rsync be?
> Date: Fri, 11 Jan 2002 17:14:06 +1100 (EST)
> From: mbp at samba.org (Martin Pool)
> 
> People often report rsync taking several hours to transfer last
> filesystems, and want to know whether there is anything they can do to
> make it work faster.  
> 
> In some cases of course there are straightforward known solutions:
> 
>  - run rsync's own network protocol, rather than using an NFS-mounted
>    filesystem
> 
>  - if the network bandwidth is large, don't use compression so as to
>    reduce CPU overhead
> 
>  - if security considerations permit, use rsyncd rather than rsync/ssh
> 
>  - turn on timestamps with -a
> 
> Anecdotally, rsync seems to be a little unscalable in terms of time
> and memory for large trees of files.  Since some of the solutions
> might be a bit complicated, it would be nice to work out roughly ahead
> of time how much potential for improvment exists.
> 
> In most of the situations where people are concerned about speed or
> memory usage they are not so much concerned with rsync's delta
> algorithm, but rather copying file trees in which most of the files
> are either already correct or do not exist at all.  
> 
> So we can compare rsync's performance to other simpler programs that
> achieve the same results to obtain comparisons for two boundary cases:
> 
>  (A) all files already exist and have the correct timestamp and size,
>      so no data need be transferred
>  
>  (B) none of the destination files exist, so all the data must be
>      transferred 
> 
> In case (A), rsync needs to traverse both directories recursively and
> stat each file to get the size and mtime on all files.  So this is
> very roughly similar to doing "ls -lR" on *both* source and
> destination directories.  Perhaps to be fair to rsync we ought to pipe
> the results from the two directory listings through diff to allow for
> the fact that at least one of the listing has to be sent from one
> process to another, and that some kind of comparison must be done.
> 
> In case (B), rsync needs to traverse the source directory, read all
> files and metadata, and write them out into the destination.  This is
> similar to running tar on the source piped into tar on the
> destination, with the destination initially empty.
> 
> The most likely reason for rsync to be slower than the other programs
> in both these cases is its approach of reading the entire directory
> into memory at the begining.  I think this should only really be a
> problem if the machine is too low on memory to hold the whole file
> list.  

Probably so, we usually trade memory for speed, which works when
memory speed >> disk speed.

>  If that is the case we can either try to hold the file list
> more efficiently (there is some scope for this), or change rsync to
> not keep the file list, or do something else.

I hope you're not considering the rdist approach, which involves 
a packet exchange for each file, this end up a disaster on long latency
connections.  Even on a high-bandwidth LAN rdist is much slower in
the no change to many files application.

In my application, the comparison is hours vs.  minutes.  That is rsync
mirrors the whole 32 GB in < 1 hr while rdist took so long, (something
in excess of 9 hrs), when it was ~25 GB that I mirror each 4GB drive
only when a change is noted on the master.

> 
> Hardlink handling is very inefficient at the moment, and I have the
> start of a patch to address that.

I'd appreciate this, some of my filesystems are heavily hard-linked,
though it's not the major driver since they have many fewer files.

> 
> In both cases we want to know about CPU and elapsed time, and perhaps
> also VM usage. 
> 
> I'm going to run some tests along these lines on my machine, and am
> interested in seeing other results or comments on why they're not a
> good way to do these measurements.

Thanks for keeping a valuable tool up-to-date.

Gordon Guthrie
gordon_guthrie at agilent.com
IT specialist