Syncing Large File Systems
scottb at bxwa.com
Fri Jul 23 21:46:45 GMT 2004
I have the following needs:
Sync millions of files.
Set the time stamps to the second (or whatever resolution the web
servers and clients see).
I need a solution for my employer and have daytime hours to devote to
it. Currently I have a crude FTP based solution which doesn't provide
timestamp syncing. Since I'm migrating to a clustered http server I need
the timestamps accurate so the client side caching will still be effective.
The following features/optional changes would make rsync suit my needs:
Run as a daemon
check one directory at a time for changes/updates needed
store an update list for each dir in a separate file (cache remote dir
spawn two processes to work on updating one dir each (one doing smallest
change, the other doing oldest change)
re-check all dirs periodically (every two hours)
re-check dirs with recent changes (last 72 hours) more often (every 30
monitor the workhorse processes and spawn a replacement or feed another
dir to update whenever it finishes one
whenever a new dir is started it is based on the current dir list
(dynamic, you can't tell right now what it will be doing in 5 minutes)
use stored list to re-check local files without bothering remote/using
bandwidth (assume remote will not be changed by others)
make all this happen on the side that is pushing the data to conserve
bandwidth and speed up the process
this way 2 million files can be checked in a short time (10-20 minutes)
I'm going to dig into the source now to see if I can make all this
happen. If anyone has any questions or suggestions, shoot. If this can
be pulled off, it will solve the huge limitation of rsync and make it's
great technology applicable to almost any filesystem.
More information about the rsync