Speed problem

Donovan Baarda abo at minkirri.apana.org.au
Thu Nov 14 10:45:01 EST 2002


On Wed, Nov 13, 2002 at 02:17:28AM -0800, jw schultz wrote:
> On Wed, Nov 13, 2002 at 10:02:34AM +0100, uwp at dicke-aersche.de wrote:
> > On Tue, 12 Nov 2002, Wayne Davison wrote:
> > 
> > > On Tue, Nov 12, 2002 at 11:30:28PM +0100, uwp at dicke-aersche.de wrote:
> > > > And why it tries to get 100% CPU even though there's nothing to do ?
> > >
> > > What do you mean "nothing to do"?  Rsync is creating the new version of
> > > a changed file which is done both by transferring data over the network
> > > and by copying matching data from the existing version of the file.
> > > Just because nothing is being transferred over the link doesn't mean
> > > nothing is going on.  Or is there some other problem that I missed in
> > > this discussion?
> > 
> > When recovery on the receiving side started, there's almost nothing to
> > do for the sending side, it just has to wait until the partial copy
> > is ready to get data appended. In this time rsync on the sending side eats
> > 95% of CPU time. I would say, that this is not the right behaviour, the
> > rsync process on the sending side should idle in this time.

The sending side is flat out verifying that the signature of the partial
copy sent by the recieving end matches. With huge partial files, this
process gets a large number of false matches from the rolling checksum that
require md4sum verification, increasing the CPU load.

> Recovery is the wrong word.  If you specified --partial and
> rsync was interrupted the receiver will delete the original
> file replacing it with the partially transfered file.  There
> is no recognition by rsync that the receiver has a partial
> file, only that the file is out of sync with the sender.

--partial can be less than helpful if rsync is interrupted only 10% of the
way into a transfer... the original gets replace with the 10% that got
transfered, throwing away another 90% worth of potential matches.

What I would be most concerned about is why the rsync died in the first
place...

> While the receiver bears the brunt of the CPU work the
> sender is hardly idle.  Aside from generating the initial

The reciever _doesn't_ bear the brunt of the CPU work, the sender does.
Unless something very drastic has been changed in rsync in the last 2
months, the sender has to calculate the delta from a signature sent by the
reciever. Delta calculation is _heaps_ more CPU intensive than signature
calculation.

If I recall correctly, the original poster was experiencing problems with
huge files...

There are known deficiencies in the (current) rsync implementation when
handling huge files. The problem is the signature's 48bit block checksums
are not big enough, and have a very high probability (over 99%) of
corrupting the result for files with over 1G of different data. This is
caught by the final checksum, causing a complete re-transmit of the whole
file.

The only way around this (currently) is to ensure that you specify a large
enough block size based on the file size. The relationship between "safe"
blocksize and filesize is non-linear, with the blocksize increasing at a
faster rate than the filesize. In the list archive I posted recommended
block size vs file size table;

http://lists.samba.org/pipermail/rsync/2002-October/008557.html

Some changes need to be made to rsync to support larger signature block
checksums.

If the huge files are sufficiently similar, then you are usually fairly safe
because what matters most is the size of non-matching data. To be on the
safe size, use a large enough block size.

> filelist it must generate and transmit block checksums for
> each file that the receiver identifies as having changed.
> Any blocks the receiver identifies as having changed the
> sender will have to re-read and send.
> 
> You may want to read the whitepaper.

hmmm... methinks you might have muddled the white-paper yourself :-)

I'm not exactly sure how the sender/reciever negotiate the include/exclude
file lists, but it is the _receiver_ that most certainly calculates and
sends the block checksums, and the _sender_ performs the rolling checksum
delta calculation to identify any matches, sending a sequence of match
details and missed data to the reciever.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------



More information about the rsync mailing list