proposal to speed rsync with lots of files

Fri Mar 6 00:48:46 GMT 2009

Kyle Lanclos wrote:
> Peter Salameh wrote:
> > One of the speed-limiting issues with rsync is having to send huge file 
> > lists when mirroring large file systems, even for incremental updates 
> > where only a small part of the file system might have changed.
> 
> Personally, I find that the sending of the file list, whether incremental
> or otherwise, takes orders of magnitude less time than the construction of
> the file list in the first place. The act of stat'ing millions of files
> takes an enormous amount of time in comparison to just about anything else,
> assuming that you are not on a low-bandwidth link.

I find both take time, and the dominant one depends on the link, and
whether that stat information is already in RAM.

Usually I only transfer very large file lists over a LAN, though, so
it's more like Kyle's situation where the file stat'ing takes longest.

The only realistic way to eliminate stat time is some kind of
filesystem monitoring and attribute index - similar to the methods
used by the dynamic indexes of local search engine style programs.
(On Linux that means using inotify, and a daemon which runs all the
time.  On Windows there are (perhaps) better methods which can survive
reboots.)

Without that, you can reduce the stat time by scanning the filesystem
in a different way.  I wrote a program many years ago called
"treescan" which did a redcrsive directory traversal while sorting
stat calls by inode number from directory's d_ino.  On many
filesystems, the inode number is approximately related to position on
the disk.  On those where it was, the heuristic sped up whole
filesystem scans by a factor of about 2, and on some directory
structures by a factor of about 100.  It's possible some parallel stat
calls would improve this further on some OSes and kernel versions, by
allowing better head seek optimisation at the kernel level.  But other
OSes or kernel versions would be slowed by it.

> What would be ideal, I think, is for rsync to scan the filesystem while
> a transfer is in place;

I think rsync 3 does this, it's called incremental scan mode.

> with a configurable quantity of file transfer threads, combined with
> a configurable quantity of filesystem "spider" threads, would result
> in the most optimal interleaving of disk latency and time required
> to transfer files.

Be careful when using multiple unsynchronised threads to access a
filesystem.  It sometimes thrashes the disk - seeking back and forth
between different files - resulting in much worse latency than just
doing one file at a time.  That said, it can work out better.  Just
have to be careful how it's done.

-- Jamie