Rsync dies

Fri May 17 12:15:02 EST 2002

> In my humble opinion, this problem with rsync growing a huge memory
> footprint when large numbers of files are involved should be #1 on
> the list of things to fix.

I think many would agree.  If it were trivial, it'd probably be
done by now.

Fix #1 (what most people do):

	Split the files/paths to limit the size of each job.

	What someone could/should do here is at least edit the
	"BUGS" section of the manual to talk about the memory
	restrictions.

Fix #2 (IMHO, what should be done to rsync):

	File caching of results (or using a file-based database of
	some sorts) is the way to go.  Instead of maintaining a
	data structure entirely in memory, open a (g)dbm file or add
	hooks into the db(3) libraries to store file metadata and
	checksums.

	It'll be slower than an all-memory implementation, but large
	jobs will at least finish predictably.

Fix #3 (what I did):

	If you really really need to efficiently transfer large
	numbers of files, come up with your own custom process.

	I used to run a large web site with thousands of files and
	directories that needed to be distributed to dozens of
	servers atomically.  Using rsync, I'd run into memory
	problems and worked around them with Fix #1.  Another
	problem was running rsync in parallel.  The source directory
	was scanned order(N) times when it needed to be scaned only
	once.  The source content server was pummeled from the
	multiple simultaneous instances.  I resorted to making my
	own single-threaded rsync-like program in Perl to behave
	more like Fix #2 and runs very efficiently.

	I've spent a some time cleaning up this program so that
	I can publish it, but priorities (*) are getting in the
	way.  When I get some time, you'll see it posted here.

--
Eric Ziegast

(*) Looking for a full-time job is a full-time job.  :^(
    Will consult for food.