"intelligent" rsync scripts?

Tue Nov 8 03:59:21 GMT 2005

On Mon, Nov 07, 2005 at 02:37:48PM -0800, Wayne Davison wrote:
> On Mon, Nov 07, 2005 at 05:03:30PM -0500, Chris Shoemaker wrote:
> > Yeah, I think I'm saying just treat (1) and (2) the same way.  OTOH,
> > if the behavior is optional and documented, I could definitely see
> > treating (1) as an exact match.
> 
> Yes, perhaps it would be better to let the user decide how strict to be.
> 
> > But you can't do the lookups until you've received the entire
> > file-list, right?
> 
> We can do the hashing of what files are present on the receiving side.
> The purpose is to create a database of files that will be used later
> when the generator is trying to find a match for a file that is missing
> (which we will discover later during the normal generator pass).
> 
> > You mean [the dir] gets removed when it's received?  Why even add it then?
> 
> Because we're creating a list of extra directories that aren't on the
> sending side and we're scanning the local directory as soon as we see
> its name in the received file list, which will cause us to hash names
> that may later turn out to be in the list that the sender sends to us.

Ok, so the purpose of the directory list is to make sure all the local
directories are scanned for potential basis files, even directores not
mentioned in the transmited file-list, right?  I didn't realize that
would require a table and delaying the scan of unknown directories
until *after* the file-list scan was done.  I assumed *all* the local
files (even those in unknown directories) could be hashed on the first
pass through the file-list.

> 
> > # of insertions = # of receiver files not in transfer
> 
> In my described algorithm it was "# of insertions = all files on the
> receiving side" because we don't know what will be in a particular
> directory until after the sender recurses clear down to the bottom of
> all child directories and comes back up and sends the last filename at
> that directory's level.  If we change the sender to send all the files
> (including all directory names) at a single level before going down into
> a subdir, we could code up the local scan to occur at the point where
> either the level changes or the dir changes at the current level.  Such
> a change would not be compatible with older rsync receivers, though (due
> to how the current receiver expects to be able to mark nested files in
> its received file-list).
> 
> Your comment does remind me that we don't want to pick an alternate
> basis file that is currently in the transfer since that file may
> possible be updated (which can cause problems if it happens at the wrong
> time).  

Are you saying only unchanged files are available as alternate basis
files?  If we can, I think it's worth avoiding this restriction.  I
imagine a case inspired by logrotate(8):

FILE  ---> renamed to ---> FILE
                        log (a new file)
log                     log.1
log.1                   log.2
log.2                   log.3
log.3                   log.4

where log.4 appears to be a missing file but is really just a renamed
log.3.  And, log.3, log.2 and log.1, will probably be retransmitted in
full (there's a problem for another day, but this is why I was
thinking of a hashtable of all files --checksums).  But the point here
is that it'd be nice to be able to use (the old) log.3 as the basis
for log.4, even while updating to the new log.3.

In general, I think that when a file is renamed, it's *very often*
precisely because the original is changing.  I.e. It's a backup.

$ cp foo foo.orig; edit foo

Not using the old foo as the basis for foo.orig just because foo
changed really hurts.  This is worth getting right.

-chris

> Thus, there would need to be a lot of hash-table deletions going
> on in my imagined algorithm in the file-name hash as well as the
> dir-name hash.
> 
> ..wayne..