duplicated file removal: call for comment

jw schultz jw at pegasys.ws
Wed Feb 12 07:02:02 EST 2003


On Tue, Feb 11, 2003 at 05:27:51PM +0100, Thomas Osterried wrote:
> This is a call for comments, regarding what you do expect when copying
> multible source tree roots leading to the same directory root, using
> rsync.
> 
> This problem may be discussed now, because in versions before
> rsync-2.5.6, the algorithm for removing the so called "duplicated files"
> was broken.
> That's why we expect nobody used it anyway in earlier versions - but who
> knows..

Part of this is something i've idly wondered.  It is,
however, important.  Let me see if i can succinctly describe
the issues and give my opinion.

To start with i want to clarify terminology.  I will not use
the word delete except when referring to actual deletion of
files from disk.

For the sake of illustration i'll assume three trees
tree[123].  Each tree has a file i'll call 'overlap' which
contains the name of the tree.  I'll refer to the file
contents by the tree name.

tree1/overlap
tree2/overlap
tree3/overlap

Whether we do "rsync --from-files"
or "rsync -r tree1/ tree2/ tree3/ dest "
isn't too critical to the discussion.
I will assume that the --files-from list always lists the
overlap file in all three trees.

The first, and outermost, issue is duplicates in list or
from combining multiple trees (-r) as i describe in the
illustration.  What contents wind up in dest/overlap?  I'm
inclined to think this should be the last one.  In other
words, tree3.  This would be consistent with progressive
copying.

The second issue is what if one is not on disk?  Say that
tree3/overlap doesn't actually exist.  Here i'm inclined to
stick with the progressive copying model and say it should
be the last one that exists.  So that tree2 would take over.
Kind of like runner-ups in a beauty contest.  Should tree3
be unable to fulfill her obligations tree2 will assume the
crown.

This is then somewhat complicated by the --delete option.
Should the destination be deleted if it is missing from the
source tree?  Which source tree?  My strongest inclination
here is to stick with my second issue response.  The file
will only be deleted if no source location is present that
corresponds to the destination.

In sum what i am most inclined towards here is that
duplicate resolution incorporate an existence check and if
the file exists (size=0 OK) each later reference will
replace the earlier one.

Regardless of the choice of first specified or last
specified, or whether to do fall-back on non-existent files
the order of specification, not lexical order needs to be
used.  Otherwise the user has no control over the merge
process.

This unfortunately does mean that a means of preserving
initial sequence must be incorporated or the qsort approach
to finding duplicates would have to be forgotten.
This could be as simple as running qsort on an array of
indices to flist->files instead of flist->files itself.




More information about the rsync mailing list