How to make big MySQL database more diffable/rsyncable? (aka rsyncing big files)

Ryan Malayter malayter at gmail.com
Tue Jul 14 20:44:27 MDT 2009


On Tue, Jul 14, 2009 at 5:17 PM, Carlos Carvalho<carlos at fisica.ufpr.br> wrote:
> Hash calculation is very fast; rsync has a negligible cpu consumption.

Hash calculation for the receiver is usually disk-bound, But rsync has
massive CPU consumption in certain cases. When using -Z on a fast
network. I have seen rsync become CPU-bound on a 100 Mbps WAN using a
3 Ghz Xeon 5400-series. Even without -Z, simply looking for hash
matches (and caluclating the strong checksums for weak matches) can be
very CPU intensive on the sending side. That is the whole point
really: rsync trades CPU for network bytes.

> What limits it is reading the disk. If you run a hash check you'll see
> the process stalled in io and not cpu. Maybe your machine has a
> particularly different IO/cpu ratio?
> This, and the fact that the maintainer(s?) want to keep rsync stateless,
> makes me think that a change to remember hashes is unlikely.

Yes, in this case the receiver *is* waiting on disk initially. The
fact that the I/O is completely pointless and takes 20-40 minutes of
wall-clock time is my issue. Why re-read 50 GB and re-calculate hashes
for it when the sender did it yesterday?

Storing a cache of hashes that are only used when a file is unchanged
would *not* change the user perception of a "stateless" of rsync, it
would simply be an optimization. Rsync uses temporary files for
partial data already.

That said, it's open source: I should just drop a patch bomb. My fear
is that it would take me forever and be rejected, as I haven't coded
in C since 1996. My employer can't sponsor development, so I figure I
would just rant in this forum.

-- 
RPM


More information about the rsync mailing list