Syncing Large File Systems

Scott Becker scottb at bxwa.com
Fri Jul 23 21:46:45 GMT 2004


I have the following needs:

Sync millions of files.
Set the time stamps to the second (or whatever resolution the web 
servers and clients see).

I need a solution for my employer and have daytime hours to devote to 
it. Currently I have a crude FTP based solution which doesn't provide 
timestamp syncing. Since I'm migrating to a clustered http server I need 
the timestamps accurate so the client side caching will still be effective.

The following features/optional changes would make rsync suit my needs:

Run as a daemon
check one directory at a time for changes/updates needed
store an update list for each dir in a separate file (cache remote dir 
listings)
spawn two processes to work on updating one dir each (one doing smallest 
change, the other doing oldest change)
re-check all dirs periodically (every two hours)
re-check dirs with recent changes (last 72 hours) more often (every 30 
minutes)
monitor the workhorse processes and spawn a replacement or feed another 
dir to update whenever it finishes one
whenever a new dir is started it is based on the current dir list 
(dynamic, you can't tell right now what it will be doing in 5 minutes)
use stored list to re-check local files without bothering remote/using 
bandwidth (assume remote will not be changed by others)
make all this happen on the side that is pushing the data to conserve 
bandwidth and speed up the process
this way 2 million files can be checked in a short time (10-20 minutes)

I'm going to dig into the source now to see if I can make all this 
happen. If anyone has any questions or suggestions, shoot. If this can 
be pulled off, it will solve the huge limitation of rsync and make it's 
great technology applicable to almost any filesystem.

    scottb
    bxwa.com



More information about the rsync mailing list