Syncing large amounts of data

jw schultz jw at pegasys.ws
Wed Feb 12 19:45:08 EST 2003


On Wed, Feb 12, 2003 at 01:13:45AM -0600, Adam Herbert wrote:
> I need some suggestions. Here's my setup:
> 
> 	800GB of Data
> 	14,000,000+ Files
>       No changes just additions
>       Files range in size from 30k - 190k
> 
> The files are laid out in a tree fashion like:
> 
> BASE
>    \-Directory ( Numerical Directory name from 0 - 1023 )
>      \-Directory ( Numerical Directory name from 0 - 1023 )
>        \- Files ( Up to 1024 files each directory )
> 
> 
> This allows for a maximum of about a billion files. I need to limit the
> amount of memory usage and processor / io time it takes to build the
> list of files to transmit. Is there a better solution that rsync? Are
> the patches that would help rsync in my particular situation?

Rsync's real advantage is when files change.  In this case
that is moot.

Certainly using rsync on the whole thing at once will
probably use more memory than you want.  You could loop
through the second level directories with rsync.

My inclination here would be to roll your own.
Something as simple as a 

	touch $newstamp
	cd $BASE
	find . -newer $laststamp | cpio -oH crc|ssh $dest 'cd $BASE; cpio -idum'
	mv $newstamp $laststamp

may be sufficient.  Building the filelist by using comm -23
on the sorted outputs of "find . -type f -print" on source
and dest may be more reliable.

For that matter it might be worthwhile building the
infrastructure to replicate the files at creation time.  The
structure you describe indicates to me that the files are
created by an automated process, build the replication into
that.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt


More information about the rsync mailing list