Rsync: Re: patch to enable faster mirroring of large filesystems
dwd at bell-labs.com
Wed Nov 28 03:49:11 EST 2001
Thank you very much for doing the test Alberto. I didn't have any set of
files that large on which I could do a test, and as I said when I tested
the worse case I could think of with my application I couldn't measure an
First, I want to make sure that you really did get the optimization turned
on. If you were able to transfer files without including all their parent
directories you probably did. If you really did get it turned on, then
there's going to be a big problem with my planned implementation of
--files-from because it was going to be essentially the same as the 2.3.2
optimization. In order to turn on the optimization, there needs to be no
'*', '[', or '?' characters anywhere in the include list. If you can use
-vv, rsync 2.3.2 should print out "(using include-only optimization"). It
may be intolerable to use -vv with such a large dataset though so perhaps
you'll want to modify your 2.3.2 exclude.c to always print that when the
optimization is enabled.
The optimization is completely implemented on the sending side so you need
to have 2.3.2 there, and I see you're pulling from a server, so it's not
the client version that really makes a difference, it's the server. Did
you set up rsync-2.3.2 on the server side? It would be good to be able to
get some cpu measurements on the sending side too if practical, otherwise
we can only really judge elapsed time.
Does your bib-dir.list file contain one wildcard in it? That's what's
needed to turn off the optimization (and as a side effect require the
parent directories). I see your server is publicly available on the
internet. Perhaps you can make your bib.list and bib-dir.list files
available so others can try to reproduce some of your test.
Please say a little more about your hardware configuration. I assume it is
solaris since you mentioned "trussing". Are these locally attached disks?
Also, could you check the peak virtual address space taken by the rsync
processes on both sides? Perhaps it is taking a lot more memory and
swapping to death.
On Tue, Nov 27, 2001 at 11:07:33AM -0500, Alberto Accomazzi wrote:
> Dear all,
> here's my own (renewed) pitch to throw in a --files-from patch.
> As Dave has suggested in the past, transferring a list of files can be
> accomplished using --include and --exclude, and has called for people
> to test the performance gains of his old optimization when using these
> options (see his original mail below).
> I've finally decided to bite the bullet and try this out on a real-life
> case, the syncronization of a directory tree containing just over 1 million
> files in 400 directories. Currently the whole directory tree is rsynced
> to our mirror sites, although only a subset of the files (about 720,000)
> are really used in production mode. Therefore having a good and simple
> way to specify the list of files to be syncronized may save time and disk
> In order to do the test, I built the list of files to be transferred
> and then fed it to rsync 2.3.2 using --include followed by an
> --exclude '*', which should trigger the include optimization Dave has
> talked about. Not 100% sure that this was the right way to do it, I also
> created a second list which also contained the list of directories in
> addition to the plain filenames, as explained in
> Either way, the results show that using the include/exclude mechanism
> is highly inefficient: the regular rsync over the whole directory tree
> of 1 million files takes about 15 minutes, while the include/exclude
> solution takes over 2 hours in one case (no directories) and it just
> hangs in the other case. It's true that when using include/exclude
> you have to account for the additional transfer of the file list from
> client to server, but bandwidth is clearly not the bottleneck in this
> case since both machines are on the same gigabit LAN. By trussing the
> processes I noticed that building the local include/exclude structure
> is very slow, but haven't looked into the details. My guess is that
> having to deal with regexps, file matching, and continuous reallocation
> of memory for the include/exclude file structure takes its toll on rsync.
> As far as I can tell the overwhelming amount of time is spent in dealing
> with manipulating the include/exclude lists rather than actually
> performing operations on files.
> Here are the numbers:
> adstree-17: wc /tmp/bib.list /tmp/bib-dir.list
> 722941 722941 13012938 /tmp/bib.list
> 723277 723277 13014618 /tmp/bib-dir.list
> 1446218 1446218 26027556 total
> adstree-18: time rsync-2.3.2 -avvn rsync://adsfore.harvard.edu/text-257/. .
> receiving file list ... done
> wrote 75 bytes read 15233741 bytes 16460.09 bytes/sec
> total size is 947471650 speedup is 62.20
> 83.88u 364.48s 15:25.50 48.4%
> adstree-19: time rsync-2.3.2 -avvn --include-from /tmp/bib.list --exclude '*' rsync://adsfore.harvard.edu/text-257/. .
> receiving file list ... done
> wrote 16627723 bytes read 72 bytes 2100.13 bytes/sec
> total size is 0 speedup is 0.00
> 3618.25u 33.40s 2:11:56.90 46.1%
> adstree-20: time rsync-2.3.2 -avvn --include-from /tmp/bib-dir.list --exclude '*' rsync://adsfore.harvard.edu/text-257/. .
> Mon Nov 26 09:27:58 EST 2001
> receiving file list ... ^C
> 3633.04u 61.41s 23:20:19.32 4.3%
> In message <20011120155555.A327 at lucent.com>, Dave Dykstra writes:
> > On Tue, Nov 20, 2001 at 11:45:44AM +0000, Lachlan Cranswick wrote:
> > >
> > > Is there any chance this can be added into the distribution as it sounds
> > > really nifty.
> > I exchanged some off-list email with the patch author and besides the fact
> > that it adds too many options I object to it because it only supports
> > copying from the local side to remote, not also from remote to local.
> > His option is essentially the same as the --files-from option that was
> > discussed last January. See the thread in the archives beginning at
> > http://lists.samba.org/pipermail/rsync/2001-January/003368.html
> > In summary, he can do pretty much what he wants by making an --include-from
> > list that lists all the parent directories of the files he wants plus all
> > the files he wants and end it with an --exclude '*', but before rsync 2.4.0
> > I had an optimization (which I put in when I officially maintained rsync)
> > that would directly read the included files in that situation rather than
> > recurse through all the directories. The author of rsync Andrew Tridgell
> > took that optimization out in 2.4.0 because he thought it was confusing
> > that the optimization didn't require explicitly listing the parent
> > directories like an --exclude '*' otherwise does, and I couldn't prove that
> > recursing through the directories made a significant performance impact.
> > Later people argued that a new option --files-from would be worth doing
> > just for convenience even if not for performance, but I said I still wanted
> > people to do some performance testing before I'd implement it. I wanted
> > people to run version 2.3.2 on their systems and compare the time
> > difference between running with and without my optimization, which you can
> > force by simply putting in a single wildcard in one included filename.
> > I still want to write a --files-from option sometime, and I'm still waiting
> > for somebody who has an application that could use it to do some
> > performance measurements with rsync 2.3.2. I agree that --files-from has
> > value on its own without performance implications, but somebody has to want
> > it badly enough to put it in a little effort if they'd like me to implement
> > it.
> -- Alberto
> Alberto Accomazzi mailto:aaccomazzi at cfa.harvard.edu
> NASA Astrophysics Data System http://adsabs.harvard.edu
> Harvard-Smithsonian Center for Astrophysics http://cfawww.harvard.edu
> 60 Garden Street, MS 83, Cambridge, MA 02138 USA
More information about the rsync