Rsync - expensive startup question

jw schultz jw at pegasys.ws
Mon Nov 10 12:54:19 EST 2003


On Sun, Nov 09, 2003 at 05:09:42PM -0500, Cedric Puddy wrote:
> 
> Hi there,
> 
> I'm using rsync with some large trees of files (on one
> disk, we have 30M files, for example, and a we might
> be copying say, 500k files in one tree.  The file trees
> are reasonably balenced -- no single directory has thousands
> of files in it, for example.  Our file system, at the moment,
> is ext3.  We are very comfortable with it, and are hesitant
> to switch away from it, though JFS or Reiserfs could be
> persuasive if people's experience strongly suggests that
> they would help.  My guess is that because the tree is
> reasonably balenced, changing filesystems isn't going
> to have a major effect on how big a bottleneck the filesystem
> may be.)
> 
> ANYWAY, the point is, as you've guessed, that I hate having
> to wait 20 or 30 minutes in order to have a transfer start
> (even when I'm copying to a location that doesn't even
> have anything there yet, thus, no possibilities of deltas
> to figure out).
> 
> I've never really asked about this because my assumption has
> always been that it takes that long, becuase it simply takes
> that long to scan the disks, populate rsync's data structures,
> and get the show on the road, and that if I want it quicker,
> then I can darn well get faster disks, etc.
> 
> (a) is that assumption correct?  Or am I missing anything?

Mostly.  Cache is an issue.  Pre-populating the inode and
dentry caches may help but would itself take time.  30
million files is too much to expect any kind of atomicity.

> (b) for those of you how understand rsync internals better
> 	than I (eg: anyone at all who's done anything with the
> 	code :P)  Is there any possibility of rsync-in-daemon
> 	mode being able to leverage the File Alteration Monitor
> 	(FAM) efforts in order to cheaply maintain a more-or-less
> 	up to the moment map of the trees it is exporting?
> 	(I have reservations about this, because I seem to recall
> 	understanding that FAM was *not* designed to watch
> 	*vast* huge portions of huge filesystems -- more that
> 	it was designed for monitoring specific resources.)

No chance in mainline.

> 	For that matter, is this not the sort of thing that
> 	ReiserFS, with it's evolution towards a pluggable
> 	architecture, might be perfect for?
> 
> (c) I assume that it would be folly (eg: something that complicates
> 	the problem space substantially) to try and write something
> 	that simply started copying, and built the map as it
> 	went along, or in the background (though I could see
> 	this as being very interesting for situations were ones
> 	network was *much* slower than ones disks).
> 
> One of the reasons I ask is that I've often come across rsync
> being used as a sort of lazy filesystem mirroring tool, the
> point being to make a sync with a remote filesystem every,
> say, 10 minutes.  Which is fine, until the file tree grows
> to large to parse in 10 minutes, in which case you have to
> (a) reduce the transfer frequency, and (b) resign yourself
> to have your i/o subsystem running flat out *all the time*.

Perhaps your Reiser4 plugin could log every file that is
changed and that could be fed to a --files-from argument or
a finely tuned utility that would rsync the files and
propagate deletes.  Or maybe the better approach would be to
invest in a real cluster filesystem.

> Also, with the "monilithic" scan, the filesystem can easily
> change between the scan being done, and the actual directory/file
> in question being copied.  Might it not be better all round
> to walk the tree progressively, making a sync plan for each
> "leaf node" of the tree as one reaches it?

Yes it would be better.  We all agree but it cannot be done
without wholesale change to the protocol.

> Anyway, I'd be interested what people think -- this is an
> awesome tool, and if there's any chance that addressing
> some of these things is technically possible, I'd like to
> know.  (Never know, I might be able to help get the work
> done, or at least fund someone)
> 
> All the best,
> 
> -Cedric
> 
> 
> -- 
> -
> |  CCj/ClearLine - Unix/NT Administration and TCP/IP Network Services
> |  118 Louisa Street, Kitchener, Ontario, N2H 5M3, 519-741-2157
> \____________________________________________________________________
>    Cedric Puddy, IS Director		cedric at thinkers.org
>      PGP Key Available at: 		http://www.thinkers.org/cedric
> 
> -- 
> To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
> 

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt



More information about the rsync mailing list