patch to enable faster mirroring of large filesystems

Dave Dykstra dwd at bell-labs.com
Tue Nov 20 08:34:59 EST 2001


Before I look at this closely, I have a couple questions.

First, what options do you use to copy?  I once saw somebody who went through
a lot of work to cache things and it turned out to be just because he
was using the --checksum option when he shouldn't have.

Next, have you taken a look at the 2.4.7 pre-release?  It includes a couple
options --write-batch and --read-batch which may do want you want, I'm not
sure.  I think the intention of that, though, was more to be able to
efficiently do copies to multiple targets, not to speed up a single copy.

You can get the latest development version of rsync by doing
    rsync -a rsync://rsync.samba.org/ftp/unpacked/rsync .

- Dave Dykstra


On Mon, Nov 19, 2001 at 04:06:45PM -0500, Andrew J. Schorr wrote:
> I have attached a patch that adds 4 options to rsync that have helped
> me to speed up my mirroring.  I hope this is useful to someone else,
> but I fear that my relative inexperience with rsync has caused me to
> miss a way to do what I want without having to patch the code.  So please
> let me know if I'm all wet.
> 
> Here's my story: I have a large filesystem (around 20 gigabytes of data)
> that I'm mirroring over a T1 link to a backup site.  Each night, 
> about 600 megabytes of data needs to be transferred to the backup site.
> Much of this data has been appended to the end of various existing files,
> so a tool like rsync that sends partial updates instead of the whole
> file is appropriate.
> 
> Normally, one could just use rsync with the --recursive and --delete features
> to do this.  However, this takes a lot more time than necessary, basically
> because rsync spends a lot of time walking through the directory tree
> (which contains over 300,000 files).
> 
> One can speed this up by caching a listing of the directory tree.  I maintain
> an additional state file at the backup site that contains a listing
> of the state of the tree after the last backup operation.  This is essentially
> equivalent to saving the output of "find . -ls" in a file.
> 
> Then, the next night, one generates the updated directory tree for the source
> file system and does a diff with the directory listing on the backup file
> system to find out what has changed.  This seems to be much faster than
> using rsync's recursive and delete features.
> 
> I have my own script and programs to delete any files that have been removed,
> and then I just need to update the files that have been added or changed.
> One could use cpio for this, but it's too slow when only partial files
> have changed.
> 
> So I added the following options to rsync:
> 
>      --source-list           SRC arg will be a (local) file name containing a list of files, or - to read file names from stdin
>      --null                  used with --source-list to indicate that the file names will be separated by null (zero) bytes instead of linefeed characters; useful with gfind -print0
>      --send-dirs             send directory entries even though not in recursive mode
>      --no-implicit-dirs      do not send implicit directories (parents of the file being sent)
> 
> The --source-list option allows me to supply an explicit list of filenames
> to transport without using the --recursive feature and without playing
> around with include and exclude files.  I'm not really clear on whether
> the include and exclude files could have gotten me the same place, but it
> seems to me that they work hand-in-hand with the --recursive feature that
> I don't want to use.
> 
> The --null flag allows me to handle files with embedded linefeeds.  This
> is in the style of gnu find's -print0 operator.
> 
> The --send-dirs overcomes a problem where rsync refuses to send directories
> unless it's in recursive mode.  One needs this to make sure that even
> empty directories get mirrored.
> 
> And the --no-implicit-dirs option turns off the default behavior in which
> all the parent directories of a file are transmitted before sending the
> file.  That default behavior is very inefficient in my scenario where I
> am taking the responsibility for sending those directories myself.
> 
> So, the patch is attached.  If you think it's an abomination, please let
> me know what the better solution is.  If you would like some elaboration
> on how this stuff really works, please let me know.
> 
> Cheers,
> Andy




More information about the rsync mailing list