Rsync: Re: patch to enable faster mirroring of large filesystems

Fri Nov 30 03:02:07 EST 2001

Here are some more results from my tests towards implementing --files-from.
I have modified (actually "hacked" is more appropriate here -- read on)
the source of rsync-2.3.2 to implement the command line option --files-from
in an effort to test how well this feature would work in a real-case scenario
with lots of files to be transferred.  I used version 2.3.2 because it 
includes Dave's optimization on the server side which sends right back the
files without attempting regular expression matching as it's done with
include/exclude patterns.

This is what my modifications do:

- add --files-from=FILE option, to read a list of files to be transferred
- modify send_exclude_list in exclude.c to send the list of files to be
  transferred in addition to the regular exclude list, and fake an 
  --exclude '*' just to turn on the optimization on the server side
- turn on buffering in send_exclude_list before the list is sent 

These are the new numbers from my latest test run with this patched version of
rsync-2.3.2; you should compare them to the ones I reported previously below; 
the list I'm sending contains 722,941 files (13MB):

> date; rsync-2.3.2 -avvzn --files-from /tmp/bib.list \
	rsync://adsfore.harvard.edu/test/. .; date
Wed Nov 28 21:50:20 EST 2001
sending files-from list at Wed Nov 28 21:50:20 2001
done sending files-from list at Wed Nov 28 23:30:16 2001
receiving file list at Wed Nov 28 23:30:16 2001
(using include-only optimization) done receiving file list at Wed Nov 28 23:35:03 2001
[ ...list of over 3,000 files to be updated... ]
total: matches=0  tag_hits=0  false_alarms=0 data=0
wrote 16640660 bytes  read 10001913 bytes  4204.62 bytes/sec
total size is 755673302  speedup is 28.36
Wed Nov 28 23:35:55 EST 2001

These numbers show that reading the filenames this way rather than using
the code in place to deal with the include/exclude list cuts the startup
time down to 0 (from 1hr).  The actual sending of the filenames is down
from 2h 15m to 1h 40m.  The reason this isn't better is due to the fact
that turning buffering on only helps the client, while the server still
has to do unbuffered reads because of the way the list is sent across. 
As far as I can tell there is no way to get around the buffering without
a protocol change or a different approach to sending this list.

Given the data above, I think implementing --files-from this way would
be the wrong way to go, for a number of reasons:

- it's a hack to treat the list of files as an include list, and prevents
  the correct use and implementation of other includes/excludes; I still
  think the two options should be orthogonal, so that saying:
	rsync --files-from=foo.list rsync://server/module .
  would be equivalent to:
	cat foo.list | xargs -n 1 -i rsync rsync://server/module/{} .
  except that in the first case we can do the transfer with one rsync call

- my patch currently implements this in a very inefficient way, given the
  fact that the file list is sent across uncompressed and unbuffered; as
  the numbers show, this is a killer for applications that as in my case
  need to send large lists across

- the option only works on the client side, while it may be desireable to 
  have the same option on the server side (just like we have for 
  include/excludes),so that people could say in the invocation of the remote 
  rsync command --files-from=foo or even put a directive "files from = foo" 
  in rsync.conf

I don't really understand the guts of rsync to be able to come up with the
right patch, but I hope that my ramblings will help move the discussion 
forward.  From what I see there is enough interest to have the option in
rsync, and I hope that we can get there, but right now I feel that a 
half-baked job is not going to cut it at least for the people like me who
have large file lists to move around.

If anybody is intererested in taking a look at the patch, you can get it from
http://ads.harvard.edu/~alberto/rsync/rsync-2.3.2-files-from.patch

-- Alberto

In message <20011128153633.B2821 at lucent.com>, Dave Dykstra writes:

> Rsync list: Alberto and I have done a couple more exchanges by private
> email, and we found that he wasn't turning on my include/exclude
> optimization in his test because he had an "exclude" directive in
> rsyncd.conf.  He has now removed that and run the test again.  His very
> interesting results are below with my comments.
> 
> Note that his case is rather pathological because he's got over a million
> files in only 400 directories, so he must have an average of over 2500
> files per directory, which are very large directories.  He's got about 65%
> of the files explicitly listed in his --include-from file.
> 
> 
> 
> On Wed, Nov 28, 2001 at 03:18:24PM -0500, Alberto Accomazzi wrote:
> ...
> > Both machines are SUN ultra 80s (2x450 UII, 1GB RAM), on a rather busy LAN,
> > so take results with a grain of salt:
> >
> > # syncronization of 1.1M files in 400 directories (135,000 to be updated):
> > 
> > > date ; rsync-2.3.2 -avvzn rsync://adsfore.harvard.edu/test/ . ; date
> > Wed Nov 28 10:01:15 EST 2001
> > receiving file list at Wed Nov 28 10:01:17 2001
> > done receiving file list at Wed Nov 28 10:27:17 2001
> > [ ...list of approx 135,000 files to be updated or 2.4 MB... ]
> > wrote 539699 bytes  read 17803469 bytes  11046.77 bytes/sec
> > total size is 1137025227  speedup is 61.99
> > Wed Nov 28 10:28:55 EST 2001
> > 
> > # syncronization of 722,941 files (13MB) in bib.list from the same director
y
> > 
> > > date ; rsync-2.3.2 -avvzn --include-from bib.list --exclude '*' rsync://a
dsfore.harvard.edu/test/. . ; date
> > Wed Nov 28 10:53:03 EST 2001
> > sending exclude list at Wed Nov 28 11:56:22 2001
> > done sending exclude list at Wed Nov 28 14:13:48 2001
> > receiving file list at Wed Nov 28 14:13:48 2001
> > (using include-only optimization) done receiving file list at Wed Nov 28 14
:18:59 2001
> > [ ...list of approx 3,200 files to be updated or 58 KB... ]
> > wrote 16640660 bytes  read 10001913 bytes  2143.15 bytes/sec
> > total size is 755673302  speedup is 28.36
> > Wed Nov 28 14:20:15 EST 2001
> 
> 
> Note the difference in total bytes written; presumably it was the exclude
> list.
> 
> 
> The astonishing thing here is the time spent by the client in fiddling and
> > sending the exclude list!  Just over 1hr to create the list in memory and 
> > more than 2hrs to send it.  When I trussed the process during the exclude
> > list sending time this is what I saw for every file:
> > 
> > [...]
> > write(3, " +  ", 2)                             = 2
> > poll(0xFFBED858, 1, 60000)                      = 1
> > write(3, " J 9 0 / J 9 0 - 0 5 8 6".., 17)      = 17
> > poll(0xFFBED850, 1, 60000)                      = 1
> > write(3, "13\0\0\0", 4)                         = 4
> > [...]
> > 
> > so it looks like sending the exclude list is quite inefficient and therefor
e
> > --file-from should definitely not use this code to do the same thing.
> 
> Indeed, it definitely should be doing buffering!  It looks like there's a
> function io_start_buffering() that should be called.  I don't know why it
> isn't called until later, though, and there may be a good reason.  It's
> getting called in send_file_list() and in do_recv(), both of which are
> called from client_run() in main.c after send_exclude_list().  Could you
> play with calling it before send_exclude_list()?  You're sending the list
> from the receiver to the sender so you're in the second half client_run(),
> the do_recv() part.  You may possibly need to call io_flush() more often,
> although I don't think so.  Could that be the reason why io_start_buffering()
> wasn't turned on earlier?  Looks like buffering can be disabled with
> io_end_buffering() if you need to.
> 
> 
> > Also I'm sure that the 1hr spent building the exclude list can be 
> > greatly reduced by just slurping in the file list in memory.
> 
> Yes, I think that 1hr can be completely bypassed by reading the
> --files-from file directly inside the send_exclude_list() function
> and bypassing all the work done by make_exclude_list() to generate the
> in-memory representation of the exclude patterns.  I sure am glad you
> ran this test because otherwise I probably wouldn't have thought of
> doing that.
> 
> Hmm, wait, the remote side would still be building the in-memory exclude
> pattern representations.  I guess that needs a short-circuit too.
> 
> 
> 
> > I guess the good news is how quickly the results came back from the server,
> > which is where your optimization kicks in. 
> 
> Yes, that only took 5 minutes!
> 
> > I've started one last test that
> > won't trigger the optimization out of curiosity, although these numbers
> > clearly show that most of the gain can be had by bypassing the
> > include/exclude dance on the client side.
> 
> I expect the 5 minutes part will go up significantly and the rest will
> stay the same.  I'd like to know by how much.
> 
> 
> - Dave Dykstra

****************************************************************************
Alberto Accomazzi                          mailto:aaccomazzi at cfa.harvard.edu
NASA Astrophysics Data System                      http://adsabs.harvard.edu
Harvard-Smithsonian Center for Astrophysics        http://cfawww.harvard.edu
60 Garden Street, MS 83, Cambridge, MA 02138 USA   
****************************************************************************