Rsync: Re: patch to enable faster mirroring of large filesystems

Alberto Accomazzi aaccomazzi at cfa.harvard.edu
Wed Nov 28 03:07:33 EST 2001


Dear all,

here's my own (renewed) pitch to throw in a --files-from patch.
As Dave has suggested in the past, transferring a list of files can be
accomplished using --include and --exclude, and has called for people
to test the performance gains of his old optimization when using these
options (see his original mail below).

I've finally decided to bite the bullet and try this out on a real-life
case, the syncronization of a directory tree containing just over 1 million
files in 400 directories.  Currently the whole directory tree is rsynced
to our mirror sites, although only a subset of the files (about 720,000) 
are really used in production mode.  Therefore having a good and simple 
way to specify the list of files to be syncronized may save time and disk 
space.

In order to do the test, I built the list of files to be transferred
and then fed it to rsync 2.3.2 using --include followed by an 
--exclude '*', which should trigger the include optimization Dave has
talked about.  Not 100% sure that this was the right way to do it, I also 
created a second list which also contained the list of directories in 
addition to the plain filenames, as explained in 
http://lists.samba.org/pipermail/rsync/2001-January/003372.html

Either way, the results show that using the include/exclude mechanism
is highly inefficient: the regular rsync over the whole directory tree
of 1 million files takes about 15 minutes, while the include/exclude
solution takes over 2 hours in one case (no directories) and it just
hangs in the other case.  It's true that when using include/exclude
you have to account for the additional transfer of the file list from 
client to server, but bandwidth is clearly not the bottleneck in this
case since both machines are on the same gigabit LAN.  By trussing the
processes I noticed that building the local include/exclude structure
is very slow, but haven't looked into the details.  My guess is that
having to deal with regexps, file matching, and continuous reallocation 
of memory for the include/exclude file structure takes its toll on rsync.
As far as I can tell the overwhelming amount of time is spent in dealing
with manipulating the include/exclude lists rather than actually
performing operations on files.

Here are the numbers:

adstree-17: wc /tmp/bib.list /tmp/bib-dir.list
 722941  722941 13012938 /tmp/bib.list
 723277  723277 13014618 /tmp/bib-dir.list
1446218 1446218 26027556 total

adstree-18: time rsync-2.3.2 -avvn rsync://adsfore.harvard.edu/text-257/. .
receiving file list ... done
wrote 75 bytes  read 15233741 bytes  16460.09 bytes/sec
total size is 947471650  speedup is 62.20
83.88u 364.48s 15:25.50 48.4%

adstree-19: time rsync-2.3.2 -avvn --include-from /tmp/bib.list --exclude '*' rsync://adsfore.harvard.edu/text-257/. .
receiving file list ... done
wrote 16627723 bytes  read 72 bytes  2100.13 bytes/sec
total size is 0  speedup is 0.00
3618.25u 33.40s 2:11:56.90 46.1%

adstree-20: time rsync-2.3.2 -avvn --include-from /tmp/bib-dir.list --exclude '*' rsync://adsfore.harvard.edu/text-257/. .
Mon Nov 26 09:27:58 EST 2001
receiving file list ... ^C
3633.04u 61.41s 23:20:19.32 4.3%


In message <20011120155555.A327 at lucent.com>, Dave Dykstra writes:

> On Tue, Nov 20, 2001 at 11:45:44AM +0000, Lachlan Cranswick wrote:
> > 
> > Is there any chance this can be added into the distribution as it sounds
> > really nifty.
> 
> I exchanged some off-list email with the patch author and besides the fact
> that it adds too many options I object to it because it only supports
> copying from the local side to remote, not also from remote to local.
> 
> His option is essentially the same as the --files-from option that was
> discussed last January.  See the thread in the archives beginning at
> 
>      http://lists.samba.org/pipermail/rsync/2001-January/003368.html
> 
> 
> In summary, he can do pretty much what he wants by making an --include-from
> list that lists all the parent directories of the files he wants plus all
> the files he wants and end it with an --exclude '*', but before rsync 2.4.0
> I had an optimization (which I put in when I officially maintained rsync)
> that would directly read the included files in that situation rather than
> recurse through all the directories.  The author of rsync Andrew Tridgell
> took that optimization out in 2.4.0 because he thought it was confusing
> that the optimization didn't require explicitly listing the parent
> directories like an --exclude '*' otherwise does, and I couldn't prove that
> recursing through the directories made a significant performance impact.
> Later people argued that a new option --files-from would be worth doing
> just for convenience even if not for performance, but I said I still wanted
> people to do some performance testing before I'd implement it.  I wanted
> people to run version 2.3.2 on their systems and compare the time
> difference between running with and without my optimization, which you can
> force by simply putting in a single wildcard in one included filename.
> 
> I still want to write a --files-from option sometime, and I'm still waiting
> for somebody who has an application that could use it to do some
> performance measurements with rsync 2.3.2.  I agree that --files-from has
> value on its own without performance implications, but somebody has to want
> it badly enough to put it in a little effort if they'd like me to implement
> it.


-- Alberto


****************************************************************************
Alberto Accomazzi                          mailto:aaccomazzi at cfa.harvard.edu
NASA Astrophysics Data System                      http://adsabs.harvard.edu
Harvard-Smithsonian Center for Astrophysics        http://cfawww.harvard.edu
60 Garden Street, MS 83, Cambridge, MA 02138 USA   
****************************************************************************




More information about the rsync mailing list