"intelligent" rsync scripts?

Thu Nov 10 07:52:40 GMT 2005

On Mon, Nov 07, 2005 at 10:59:21PM -0500, Chris Shoemaker wrote:
> Ok, so the purpose of the directory list is to make sure all the local
> directories are scanned for potential basis files, even directores not
> mentioned in the transmited file-list, right?  I didn't realize that
> would require a table and delaying the scan of unknown directories
> until *after* the file-list scan was done.

The idea behind that was to avoid delaying the reception of the file
list, but it would also be possible to immediately scan the extra
directories instead (but this is largely moot -- see below).

> Are you saying only unchanged files are available as alternate basis
> files?  If we can, I think it's worth avoiding this restriction.

If we were to use the files directly, then it would be complicated to
try to order the updates to avoid changing a file before another file
could use it as a basis file.  However, I've come up with an algorithm
I like better that avoids this restriction completely:

Rsync already supports the idea of a "partial dir" that can be scanned
for partially-transferred files and delayed updates.  I'm thinking that
hard-linking files into this directory makes this new feature much
easier and more memory efficient (the dir is named ".~tmp~" by default,
relative to the containing directory of the to-be-updated files).

I also thought through where I'd like the rename scan to go.  I finally
decided that I liked the idea of piggy-backing the scan on the existing
delete-before or delete-during scans that already occur, since this
makes the logic much simpler (the code already exists to handle all the
proper include/exclude logic, including local .cvsignore/.rsync-filter
files) and it should also make the scan quick because it will take
advantage of disk I/O that is either already occurring, or is at least
in close proximity to identical stat() calls that the generator's update
code is going to make.  (If either --delete-after was selected or no
deletions are occurring, rsync does the rename scan during the transfer
using a non-deleting version of the delete-during code).  The only
potential problem with this scan position is that the receiving side may
not have fully finished its scan when we encounter a missing file that
doesn't have a size+mtime match yet, so I allow missing files to be
delayed until the receiving-side scan is complete (at which point we
check to see if a match has shown up yet or not).

My code also attempts to match up files even when they're not missing.
This works to the fullest extent when a delete-before scan is in effect,
but it still handles the case of the rotating log files quite nicely
(associating all the moved files together as you would expect).

A patch for the CVS version is here:

    http://opencoder.net/detect-renames.diff

The code is still a little ugly, but it does appear to work well in my
limited testing.  If I like the idea, I'll look into how to share the
code for the delete scan in a way that is not as ugly as the current
logic.

> $ cp foo foo.orig; edit foo
> 
> Not using the old foo as the basis for foo.orig just because foo
> changed really hurts.

If the user uses "cp -p foo foo.orig" we will find it.  The patch could
be extended to switch from size+mtime to use size+checksum, but I
haven't done that yet (and checksumming is so slow that most folks tend
to avoid it).

..wayne..