Rsync'ing lists of files

Dave Dykstra dwd at bell-labs.com
Mon Jun 10 10:40:12 EST 2002


On Fri, Jun 07, 2002 at 06:23:32PM -0400, Stephane Paltani wrote:
> Hi Everybody,
> 
> I'm new to this list, but I have been using rsync for quite some time.
> First, congratulations to the rsync team for a very fine piece of software!
> 
> I'm wondering whether rsync could help me to perform the following task:
> 
> I have 5 million files on one side of the ocean, 100000 of which must be
> copied to the other side. Both numbers grow with time, and occasionally, some
> files must be removed from the "to be copied" list (i.e., they must be
> deleted on the receiving side, but kept on the sending side). I currently
> do this manually, but having rsync doing it would mean that the two archives
> could be sync'ed much more regularly.
> 
> I tried to use a combination of --include-from=<list of files> --exclude='*',
> and it seems to work. However, I have the impression that the algorithm
> is far from optimal in this case: There is no usable pattern in the
> file names, and I have to list all of them in the "--include-from" file.
> rsync therefore makes 5000000 x 100000 comparisons approximately. The building
> of the file list is therefore extremely slow (found 8000 files after 2 hours,
> i.e. ~24 hours just to build the file list).
> [correct me if my understanding of how rsync works is wrong].
> 
> I have the impression that the above situation might not be
> so uncommon. So, is there another way that I missed in the doc to do that?
> What I would be looking for is a parameter:
> --file-list=<list of files> (which would override any "--in/exclude").
> rsync would only consider these files, and ignore all the other ones,
> and also a "--delete-not-in-list" flag which would make all the
> files on the receiving side be deleted if they are not in the list.
> 
> Of course, if there is another way using current rsync, it would be great!
> And sorry if I missed an obvious solution...


Sigh, another request for the --files-from I promised to write over 6
months ago, but I've been so overloaded at work lately that I don't know if
I'm ever going to get to it.  Perhaps someone else will have to do it.

Somone implemented a version that was part of the way there at
    http://lists.samba.org/pipermail/rsync/2001-November/005272.html
but among other problems it only worked when sending files and not when
receiving files.

It turns out that back in rsync 2.3.2 and earlier there was an optimization
(which I wrote and actually was the primary reason that I volunteered to be
maintainer of rsync for a while) that kicked in when there was list of
includes with no wildcards followed by an --exclude '*', and there was no
--delete.  Instead of recursing through the files and doing comparisons, it
would just directly open the files in the include list.  It only had to be
on the sending side, you might want to try 2.3.2 on your sending side to
see if you get a significant performance boost.  Andrew Tridgell took it
out in 2.4.0 because he didn't like how it changed the usual semantics of
requiring all parent directories to be explicitly listed in the include
list.

Your --delete-not-in-list suggestion has not been considered before, but
something like that makes sense to me.

- Dave Dykstra




More information about the rsync mailing list