Any (known) scaling issues?

Bret Foreman bret.foreman at oracle.com
Fri Jan 16 23:55:42 GMT 2004


I'm considering using rsync in our data center but I'm worried about whether
it will scale to the numbers and sizes we deal with. We would be moving up
to a terabyte in a typical sync, consisting of about a million files. Our
data mover machines are RedHat Linux Advanced Server 2.1 and all the sources
and destinations are NFS mounts. The data is stored on big NFS file servers.
The destination will typically be empty and rsync will have to copy
everything. However, the copy operation takes many hours and often gets
interrupted by an outage. In that case, the operator should be able to
restart the process and it resumes where it left off.
The current, less than desirable, method uses tar. In the event of an
outage, everything needs to be copied again. I'm hoping rsync could avoid
this and pick up where it left off.
There are really two scaling problems here:
1) Number and size of files - What are the theoretical limits in rsycn? What
are the demonstrated maxima?
2) Performance - The current tar-based method breaks the mount points down
into (a few dozen) subdirectories and runs multiple tar processes. This does
a much better job of keeping the GigE pipes full than a single process and
allows the load to be spread over the 4 CPUs in the Linux box. Is there a
better way to do this with rsync or would we do the same thing, generate one
rsync call for each subdirectory? A major drawback of the subdirectory
approach is that tuning to find the optimum number of copy processes is
almost impossible. Is anyone looking at multithreading rsync to copy many
files at once and get more CPU utilization from a multi-CPU machine? We're
moving about 10 terabytes a week (and rising) so whatever we use has to keep
those GigE pipes full.

Thanks,
Bret



More information about the rsync mailing list