"intelligent" rsync scripts?

Mon Nov 7 20:01:35 GMT 2005

On Wed, Oct 26, 2005 at 02:04:34PM -0400, Chris Shoemaker wrote:
> That option should imply at least, --checksum and --delete-after if
> --delete at all.

I don't think it needs --checksum because rsync can simply use a
non-exact match as the basis file for the transfer.

> For each file on the sender which is *missing* from the receiver, it
> needs to search the checksums of all of receiver's existing files for
> a checksum match.

I'd make it: (1) lookup a file-size + mod-time + file-name match;
if found, copy that file locally and consider the update done. (2)
lookup a file-size + mod-time match OR just a file-name match, and use
that file as a basis file in the transfer, which can greatly speed it up
the transfer if the file is largely the same as the new file.

The way I see this being implemented is to add a hash-table algorithm to
the code so that rsync can hash several things as the names arrive
during the opening file-list reception stage:  the receiving side would
take every arriving directory name (starting with the dest dir) and
lookup the names in the local version of that dir, creating a hash table
based on file-size + mod-time, a hash table based on file-name (for
regular files), and a hash table based on any directory names it finds
(this attempts to do the receiving side scanning incrementally as the
names arrive instead of during a separate pass after the file-list is
finished).  As each directory gets scanned, that name gets removed from
the directory-name hash.  At the end of the file-list reception, any
remaining directory names in the dir-hash table also get scanned
(recursively).  This would give us the needed info in the generator to
allow it to lookup missing files to check for exact or close matches.

One vital decision is picking a good hash-table algorithm that allows
the table to grow larger efficiently (since we don't know how many files
we need to hash before-hand).  I'm thinking that trying the libiberty
hashtab.c version might be a good starting point.  Suggestions?  Perhaps
a better idea than a general-purpose hash-table algorithm might be to
just collect all the data in an array (expanding the array as needed)
and then sort it when we're all done.  This would use a binary-search
algorithm to find a match.  The reason this might be better is that it
is likely that the number of missing files will not be a huge percentage
of the transfer, so making the creation of the "hash table" efficient
might be more important than making the lookup of missing files
maximally efficient.

Have you done any work on this, Chris?  If not, I'm thinking of looking
into this soon.

..wayne..