Rsync - expensive startup question

Cedric Puddy cedric at cadence.thinkers.org
Mon Nov 10 09:09:42 EST 2003


Hi there,

I'm using rsync with some large trees of files (on one
disk, we have 30M files, for example, and a we might
be copying say, 500k files in one tree.  The file trees
are reasonably balenced -- no single directory has thousands
of files in it, for example.  Our file system, at the moment,
is ext3.  We are very comfortable with it, and are hesitant
to switch away from it, though JFS or Reiserfs could be
persuasive if people's experience strongly suggests that
they would help.  My guess is that because the tree is
reasonably balenced, changing filesystems isn't going
to have a major effect on how big a bottleneck the filesystem
may be.)

ANYWAY, the point is, as you've guessed, that I hate having
to wait 20 or 30 minutes in order to have a transfer start
(even when I'm copying to a location that doesn't even
have anything there yet, thus, no possibilities of deltas
to figure out).

I've never really asked about this because my assumption has
always been that it takes that long, becuase it simply takes
that long to scan the disks, populate rsync's data structures,
and get the show on the road, and that if I want it quicker,
then I can darn well get faster disks, etc.

(a) is that assumption correct?  Or am I missing anything?

(b) for those of you how understand rsync internals better
	than I (eg: anyone at all who's done anything with the
	code :P)  Is there any possibility of rsync-in-daemon
	mode being able to leverage the File Alteration Monitor
	(FAM) efforts in order to cheaply maintain a more-or-less
	up to the moment map of the trees it is exporting?
	(I have reservations about this, because I seem to recall
	understanding that FAM was *not* designed to watch
	*vast* huge portions of huge filesystems -- more that
	it was designed for monitoring specific resources.)

	For that matter, is this not the sort of thing that
	ReiserFS, with it's evolution towards a pluggable
	architecture, might be perfect for?

(c) I assume that it would be folly (eg: something that complicates
	the problem space substantially) to try and write something
	that simply started copying, and built the map as it
	went along, or in the background (though I could see
	this as being very interesting for situations were ones
	network was *much* slower than ones disks).

One of the reasons I ask is that I've often come across rsync
being used as a sort of lazy filesystem mirroring tool, the
point being to make a sync with a remote filesystem every,
say, 10 minutes.  Which is fine, until the file tree grows
to large to parse in 10 minutes, in which case you have to
(a) reduce the transfer frequency, and (b) resign yourself
to have your i/o subsystem running flat out *all the time*.

Also, with the "monilithic" scan, the filesystem can easily
change between the scan being done, and the actual directory/file
in question being copied.  Might it not be better all round
to walk the tree progressively, making a sync plan for each
"leaf node" of the tree as one reaches it?

Anyway, I'd be interested what people think -- this is an
awesome tool, and if there's any chance that addressing
some of these things is technically possible, I'd like to
know.  (Never know, I might be able to help get the work
done, or at least fund someone)

All the best,

-Cedric


-- 
-
|  CCj/ClearLine - Unix/NT Administration and TCP/IP Network Services
|  118 Louisa Street, Kitchener, Ontario, N2H 5M3, 519-741-2157
\____________________________________________________________________
   Cedric Puddy, IS Director		cedric at thinkers.org
     PGP Key Available at: 		http://www.thinkers.org/cedric




More information about the rsync mailing list