"intelligent" rsync scripts?
c.shoemaker at cox.net
Mon Nov 7 22:03:30 GMT 2005
On Mon, Nov 07, 2005 at 12:01:35PM -0800, Wayne Davison wrote:
> On Wed, Oct 26, 2005 at 02:04:34PM -0400, Chris Shoemaker wrote:
> > That option should imply at least, --checksum and --delete-after if
> > --delete at all.
> I don't think it needs --checksum because rsync can simply use a
> non-exact match as the basis file for the transfer.
Hmm... I think you're right. I need to remember that it's not
necessary to _always_ avoid the use of an incorrect basis. It's just
more efficient to make it unlikely.
> > For each file on the sender which is *missing* from the receiver, it
> > needs to search the checksums of all of receiver's existing files for
> > a checksum match.
> I'd make it: (1) lookup a file-size + mod-time + file-name match;
> if found, copy that file locally and consider the update done.
I don't know about "consider the update done". This is less strict
than current behavior, since the paths *are* different. Someone may
depend on that behavior (unlikely, I know). I think it's a little
safer to still checksum the moved file. IOW, treat it the same way
you would a "--fuzzy" match. (assuming I understand that feature.)
> lookup a file-size + mod-time match OR just a file-name match, and use
> that file as a basis file in the transfer, which can greatly speed it up
> the transfer if the file is largely the same as the new file.
Yeah, I think I'm saying just treat (1) and (2) the same way. OTOH,
if the behavior is optional and documented, I could definitely see
treating (1) as an exact match. I guess I was thinking that treating
(1) as a fuzzy match would be required if the rename detection was
default behavior. (And, depending on the cost, I wouldn't necessarily
mind it being the default.)
> The way I see this being implemented is to add a hash-table algorithm to
> the code so that rsync can hash several things as the names arrive
> during the opening file-list reception stage: the receiving side would
> take every arriving directory name (starting with the dest dir) and
> lookup the names in the local version of that dir, creating a hash table
> based on file-size + mod-time, a hash table based on file-name (for
> regular files), and a hash table based on any directory names it finds
Meaning just the receiver directories NOT in the arriving list, or
*every* receiver directory?
> (this attempts to do the receiving side scanning incrementally as the
> names arrive instead of during a separate pass after the file-list is
But you can't do the lookups until you've received the entire
file-list, right? Otherwise you may not have yet seen the "originals"
of the moved files.
> As each directory gets scanned, that name gets removed from
> the directory-name hash.
You mean it gets removed when it's received? Why even add it then?
I'm probably missing something here.
> At the end of the file-list reception, any
> remaining directory names in the dir-hash table also get scanned
> (recursively). This would give us the needed info in the generator to
> allow it to lookup missing files to check for exact or close matches.
> One vital decision is picking a good hash-table algorithm that allows
> the table to grow larger efficiently (since we don't know how many files
> we need to hash before-hand). I'm thinking that trying the libiberty
> hashtab.c version might be a good starting point. Suggestions? Perhaps
> a better idea than a general-purpose hash-table algorithm might be to
> just collect all the data in an array (expanding the array as needed)
> and then sort it when we're all done. This would use a binary-search
> algorithm to find a match. The reason this might be better is that it
> is likely that the number of missing files will not be a huge percentage
> of the transfer, so making the creation of the "hash table" efficient
> might be more important than making the lookup of missing files
> maximally efficient.
# of insertions = # of receiver files not in transfer
# of lookups = # of sender files missing from receiver
I can't think of a reason why either term would dominate. But,
pipeline concerns may make it better to push the cost into the later
operation, i.e. lookup. That would suggest using an array for
> Have you done any work on this, Chris? If not, I'm thinking of looking
> into this soon.
Nothing more than thinking. It's been #3 on my list since the
original post, but #1 and #2 aren't wrapping up quickly. I was hoping
you'd like the idea enough to beat me to it. :)
More information about the rsync