"intelligent" rsync scripts?

Mon Nov 7 22:03:30 GMT 2005

On Mon, Nov 07, 2005 at 12:01:35PM -0800, Wayne Davison wrote:
> On Wed, Oct 26, 2005 at 02:04:34PM -0400, Chris Shoemaker wrote:
> > That option should imply at least, --checksum and --delete-after if
> > --delete at all.
> 
> I don't think it needs --checksum because rsync can simply use a
> non-exact match as the basis file for the transfer.

Hmm... I think you're right.  I need to remember that it's not
necessary to _always_ avoid the use of an incorrect basis.  It's just
more efficient to make it unlikely.

> 
> > For each file on the sender which is *missing* from the receiver, it
> > needs to search the checksums of all of receiver's existing files for
> > a checksum match.
> 
> I'd make it: (1) lookup a file-size + mod-time + file-name match;
> if found, copy that file locally and consider the update done. 

I don't know about "consider the update done".  This is less strict
than current behavior, since the paths *are* different.  Someone may
depend on that behavior (unlikely, I know).  I think it's a little
safer to still checksum the moved file.  IOW, treat it the same way
you would a "--fuzzy" match.  (assuming I understand that feature.)

> (2)
> lookup a file-size + mod-time match OR just a file-name match, and use
> that file as a basis file in the transfer, which can greatly speed it up
> the transfer if the file is largely the same as the new file.

Yeah, I think I'm saying just treat (1) and (2) the same way.  OTOH,
if the behavior is optional and documented, I could definitely see
treating (1) as an exact match.  I guess I was thinking that treating
(1) as a fuzzy match would be required if the rename detection was
default behavior.  (And, depending on the cost, I wouldn't necessarily
mind it being the default.)

> 
> The way I see this being implemented is to add a hash-table algorithm to
> the code so that rsync can hash several things as the names arrive
> during the opening file-list reception stage:  the receiving side would
> take every arriving directory name (starting with the dest dir) and
> lookup the names in the local version of that dir, creating a hash table
> based on file-size + mod-time, a hash table based on file-name (for
> regular files), and a hash table based on any directory names it finds

Meaning just the receiver directories NOT in the arriving list, or
*every* receiver directory?

> (this attempts to do the receiving side scanning incrementally as the
> names arrive instead of during a separate pass after the file-list is
> finished).  

But you can't do the lookups until you've received the entire
file-list, right?  Otherwise you may not have yet seen the "originals"
of the moved files.

> As each directory gets scanned, that name gets removed from
> the directory-name hash.  

You mean it gets removed when it's received?  Why even add it then?
I'm probably missing something here.

> At the end of the file-list reception, any
> remaining directory names in the dir-hash table also get scanned
> (recursively).  This would give us the needed info in the generator to
> allow it to lookup missing files to check for exact or close matches.

Yes.

> 
> One vital decision is picking a good hash-table algorithm that allows
> the table to grow larger efficiently (since we don't know how many files
> we need to hash before-hand).  I'm thinking that trying the libiberty
> hashtab.c version might be a good starting point.  Suggestions?  Perhaps
> a better idea than a general-purpose hash-table algorithm might be to
> just collect all the data in an array (expanding the array as needed)
> and then sort it when we're all done.  This would use a binary-search
> algorithm to find a match.  The reason this might be better is that it
> is likely that the number of missing files will not be a huge percentage
> of the transfer, so making the creation of the "hash table" efficient
> might be more important than making the lookup of missing files
> maximally efficient.

# of insertions = # of receiver files not in transfer
# of lookups = # of sender files missing from receiver

I can't think of a reason why either term would dominate.  But,
pipeline concerns may make it better to push the cost into the later
operation, i.e. lookup.  That would suggest using an array for
constant-cost insertion.  

> 
> Have you done any work on this, Chris?  If not, I'm thinking of looking
> into this soon.

Nothing more than thinking.  It's been #3 on my list since the
original post, but #1 and #2 aren't wrapping up quickly.  I was hoping
you'd like the idea enough to beat me to it.  :)

-chris

> 
> ..wayne..