Any (known) scaling issues?
jw at pegasys.ws
Sat Jan 17 00:25:52 GMT 2004
On Fri, Jan 16, 2004 at 03:55:42PM -0800, Bret Foreman wrote:
> I'm considering using rsync in our data center but I'm worried about whether
> it will scale to the numbers and sizes we deal with. We would be moving up
> to a terabyte in a typical sync, consisting of about a million files. Our
> data mover machines are RedHat Linux Advanced Server 2.1 and all the sources
> and destinations are NFS mounts. The data is stored on big NFS file servers.
> The destination will typically be empty and rsync will have to copy
> everything. However, the copy operation takes many hours and often gets
> interrupted by an outage. In that case, the operator should be able to
> restart the process and it resumes where it left off.
> The current, less than desirable, method uses tar. In the event of an
> outage, everything needs to be copied again. I'm hoping rsync could avoid
> this and pick up where it left off.
> There are really two scaling problems here:
> 1) Number and size of files - What are the theoretical limits in rsycn? What
> are the demonstrated maxima?
> 2) Performance - The current tar-based method breaks the mount points down
> into (a few dozen) subdirectories and runs multiple tar processes. This does
> a much better job of keeping the GigE pipes full than a single process and
> allows the load to be spread over the 4 CPUs in the Linux box. Is there a
> better way to do this with rsync or would we do the same thing, generate one
> rsync call for each subdirectory? A major drawback of the subdirectory
> approach is that tuning to find the optimum number of copy processes is
> almost impossible. Is anyone looking at multithreading rsync to copy many
> files at once and get more CPU utilization from a multi-CPU machine? We're
> moving about 10 terabytes a week (and rising) so whatever we use has to keep
> those GigE pipes full.
The numbers you site should be no problem for rsync.
However, the scenario is one that rsync has no real
advantage and several disadvantages. You are copying, not
syncing so rsync will be slower. Your network is faster
than the disks and rsync is designed for disks several times
faster than the network. Rsync is even worse over NFS and
you are doing NFS to NFS copies. All in all, i wouldn't use
rsync. My inclination would be to use cpio -p with no -u.
The one thing rsync gets you is checksumming and NFS over
udp has a measurable data corruption rate but caches are
likely to defeat rsync's checksums so a seperate checksum
cycle would still be wanted.
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync