duplicated file removal: call for comment

Thomas Osterried thomas at osterried.de
Wed Feb 12 03:27:51 EST 2003


This is a call for comments, regarding what you do expect when copying
multible source tree roots leading to the same directory root, using
rsync.

This problem may be discussed now, because in versions before
rsync-2.5.6, the algorithm for removing the so called "duplicated files"
was broken.
That's why we expect nobody used it anyway in earlier versions - but who
knows..


Example:
--------

Whe have treeA/ treeB/ treeC/ and treeD/, which may or may not have
common files or directories in the depth. the contense of a file in
treeA/etc/etctest for e.g. may differ from treeD/etc/etctest. In my
example, treeA/etc/etctest contains the line "treeA", and alike.

Let's look what's the way of standard unix programs:

$ mkdir /tmp/test
$ cp -r tree*/* /tmp/test/
$ cat /tmp/test/etc/etctest 
treeD
$ rm -rf /tmp/test/*
$ scp -r tree*/* thomas at localhost:/tmp/test
[..]
$ cat /tmp/test/etc/etctest 
treeD
$ 

Using these programs, the last reference of a file wins.
Nb: this is, what a user should suspect.
Note that each duplicated file has been copied / transmitted.


rsync has routine which removes duplicates from given source-trees
before actually transmitting the contense to the destination.

Due to a bug in earlier versions, it was hardly predictable which file
from which source-tree would win. Especially, if for e.g. some trees
in the middle or the last referenced trees did not have such a file.


Wayne Davison has done a fix for version 2.5.6.
A while ago, i advised a patch to the list
(http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html).

Our both patches do fix the problem, but with a contrary result: With
my patch the last reference of a duplicated file wins (in my example,
treeD's etc/etctest); with Wayne's fix the first reference succeeds
(treeA's etc/etctest).


When looking at the code, i assume my proposal was initially intended;
then, some years ago, the code was changed and the duplicate file
deletion mechanism went broken.


The project where i use rsync relys on a correctly working and
predictable file-deletion mechanism.
Therfore, this method should not change or flip/flop in future versions.

That's why i like to discuss the both approaches, leading to one final
proposal.


If you're interested in detail, read my discussion with Wayne, who
allowed me to quote.


Regards,

	- Thomas Osterried

=====

My proposal for a change for rsync-2.5.6:
-----------------------------------------

diff -r rsync-2.5.6/flist.c ../rsync-2.5.6-patched/flist.c
1274c1274
<                                       name, i);
---
>                                       prev_name, i-1);
1281c1281
<                               flist->files[i][0] = null_file;
---
>                               flist->files[i-1][0] = null_file;
1283c1283
<                               free_file(flist->files[i]);
---
>                               free_file(flist->files[i-1]);


=====

Our discussion
--------------

My mail:
Subject: rsync: duplicate name deletion - problems with rsync-2.5.6

Dear Wayne,

i'd like to diskuss some problems with your fix of rsync-2.5.6:
|   * Fixed a crash that would occur when sending a list of files that
|     contains a duplicate name (if it sorts to the end of the file
|     list) and using --delete.  (Wayne Davison)

my initial proposal for a fix i posted, as suggested at the project's web site,
to the rsync mailing list was:
  http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html
unfortunately, the only response to this issue was "Huh? I don't care
about this at all."

our both fixes for clean_flist() work. but with your fix, the clearing
process will delete the second and following references of duplicated
files.
while in my approach the last reference of a duplicated file wins.

i think, that deleting all but the last reference
- is the right way because it's exactly what you expect and are experienced
  with, for e.g. when copying with "cp tree*/etc/etctest dest/etc/")
- was the idea of the author of clean_flist(), because
    - he annonced
      rprintf(FINFO, "removing duplicate name %s from file list %d\n", f_name(flist->files[i - 1]), i - 1);
    - because he startet his for() loop with i = 1, comparing the backward
      list-entries at i-1.
  the only mistake was, to free the current list entry instead of the previous.
  see 1).


because the rsync protocol version number (26) is the same, i'm unable
to detect which behaviour of rsync considering duplicates i should expect
at the target site.

personaly, i would much appreciate that the last reference of a
duplicated file wins.
we use rsync for synchronizing a pool of 174 computers at university. rsync
is the heart of the conzept where we have one linux installation on a master
server (the "class tree"), several hardware- and configuration- memberships
("group trees") and machine-specific configurations.
here we make intensively use of rsync's file-deletion mechanism, and
rely on the perfect functionality. as an example, an rsync process with its
arguments:
21934 ?        R      0:00 /simpel2/public/install-linux-i386/rsync -a -W
--timeout=1800 --numeric-ids --port 288 -v --stats --delete
--exclude-from=/simpel2/adm/exclude_from_class_tree.txt /simpel2/trees/classes/linux-i386/ws/ /simpel2/trees/kernel/linux-2.4.19/modules-i386/ /simpel2/trees/groups/auto/linux-lilo-hda/ /simpel2/trees/groups/all-defaults/ /simpel2/trees/groups/linux-defaults/ /simpel2/trees/groups/all-lilo/ /simpel2/trees/groups/all-inf/ /simpel2/trees/groups/linux-X11sis/ /simpel2/trees/machines/ws/inf.fu-berlin.de/pc145/ /simpel2/public/update-all/ simpel-update at 160.45.112.145::simpel-update/


my suggestion is, to change your fix like in 2).


Appendix 1:
-----------
this bug lead to a flipflop
sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids
 --delete treeA/ treeB/ treeC/ treeD/ dest/
sh-2.05a# cat dest/etc/etctest
treeA
sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids
 --delete treeA/ treeB/ treeC/ treeD/ dest/
sh-2.05a# cat dest/etc/etctest
treeC
sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids
 --delete treeA/ treeB/ treeC/ treeD/ dest/
sh-2.05a# cat dest/etc/etctest
treeA
sh-2.05a#
..and is also wrong, because i expected treeD. if i delete treeC, then
it flipflops between treeA and treeD.
that's clear du to the fact that the current list reference was free'd instead
of the previous list entry, and thus the next while-loop could not detect
that the current reference is also a duplicate of the previous or pre-previous.

[..]

-----

Waye:
Subject: Re: rsync: duplicate name deletion - problems with rsync-2.5.6

> my initial proposal for a fix i posted, as suggested at the project's
> web site, to the rsync mailing list was:
> http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html
> unfortunately, the only response to this issue was "Huh? I don't care
> about this at all."

I wish I had remembered that post before I worked up a patch, but I
didn't.  I recall at the time being surprised that rsync even allowed
--delete to work in a directory-merging situation, but since I wasn't
working on the rsync project at the time, I obviously forgot about it.

One thing I did do when I worked up the patch was to look through CVS.
I discovered that the change occurred in revision 1.26 -- originally
released as rsync 1.7.0.  Because of this, I thought that the safest fix
would be to continue on with the current practice of deleting the second
entry from the list rather than the first.  In tridge's CVS comments he
doesn't mention making this change consciously, but I couldn't discount
that it was intentional, so I left it as it had been for all the recent
versions.  I also didn't bring the issue up on the list, which was a
mistake.

> personaly, i would much appreciate that the last reference of a
> duplicated file wins.

Unfortunately for us, we are doing duplicate-removal on a sorted list of
filenames, so we can't guarantee that the order of duplicate-removal
follows the order specified on the command-line.  For instance, if, in
your following example, you had specified the directories in the order
"treeB/ treeD/ treeA/ treeC/", the current duplicate-removal would leave
treeA's version, and your suggested fix would leave treeD's version. 
The only fix for this would be to override the ASCII path sort with a 
numerical-ID when the names come from the command-line (or something
like that).  I don't know if this would be worth the trouble.

> my suggestion is, to change your fix like in 2).

I think we should discuss this on the list and see what people think.  I
would only hesitate to change it because of how long the change has been
released this way.  On the positive side, this change does not affect
most people's use of rsync, so it would probably be safe.

> [example merge combined with] this bug lead to a flipflop

The flipflop was because rsync wasn't deleting all the duplicates, just
every other extra name out of the sorted list.  This is a bug that got
fixed in the latest release, so the flipflop is now gone, and the result
will always be from treeA (at the moment).

Thanks for bringing this issue up.  If you would care to bring the issue
to the wider audience of the rsync mailing list, you may feel free to
quote any of this reply.  Or just let me know what you think of my
comments.

..wayne..

-----

My answer:

On Wed, Feb 05, 2003 at 11:18:24AM -0800, Wayne Davison wrote:
> On Wed, Feb 05, 2003 at 02:40:25PM +0100, Thomas Osterried wrote:   
> I wish I had remembered that post before I worked up a patch, but I   
> didn't.  I recall at the time being surprised that rsync even allowed
> --delete to work in a directory-merging situation, but since I wasn't
> working on the rsync project at the time, I obviously forgot about it.

;)

when i looked how to build our sync-list, i was glad that rsync, which
we already used in an early "flat hierarchy" version of our project,
could do this "out of the box".
what i mentioned with flipflop, yes this was in pre 2.5.6 versions
(before your official and my private fix).
because it was broken and not determistic, i think nobody used it in a
critical environment - and if (and fixed it), we have a 50/50 chance
if he changed it in your way or in mine ;)

> > personaly, i would much appreciate that the last reference of a
> > duplicated file wins.
> 
> Unfortunately for us, we are doing duplicate-removal on a sorted list of
> filenames, so we can't guarantee that the order of duplicate-removal
> follows the order specified on the command-line.  For instance, if, in

we use this system for abt 1.5 years now. experience by practise is,   
that the sort-algorithm leaves the original priority list intact.
looking at the code, i think you mean the result of qsort() could change
the position of the duplicate files in respect to the source tree
and my "priority list" _may_ change?

> your following example, you had specified the directories in the order
> "treeB/ treeD/ treeA/ treeC/", the current duplicate-removal would leave
> treeA's version, and your suggested fix would leave treeD's version.
> The only fix for this would be to override the ASCII path sort with a
> numerical-ID when the names come from the command-line (or something
> like that).  I don't know if this would be worth the trouble.

no, the alphanumeric name of the source directory has and should have
no influence, but the position in the list.
in the list "treeB/ treeD/ treeA/ treeC/" the result will be
  - treeB with your patch (first reference wins)
  - treeC with my proposal (last reference wins)

let's see (should work):

your rsync:
$ /export/local-1/src/rsync-origs/rsync-2.5.6/rsync -a -W --numeric-ids --delete treeB/ treeD/ treeA/ treeC/ dest/
$ cat dest/etc/etctest
 treeB
$

my patche'd rsync 2.5.5:
$ /export/local-1/src/rsync-2.5.5/rsync -a -W --numeric-ids  --delete treeB/ treeD/ treeA/ treeC/ dest/
$ cat dest/etc/etctest
treeC

$

my proposed change from Appendix 2 of my last mail in your rsync 2.5.6 version:
$ /export/local-1/src/rsync-2.5.6-patched/rsync -a -W --numeric-ids  --delete treeB/ treeD/ treeA/ treeC/ dest/
$ cat dest/etc/etctest
treeC
$

fine.

> > my suggestion is, to change your fix like in 2).
> 
> I think we should discuss this on the list and see what people think.  I
> would only hesitate to change it because of how long the change has been
> released this way.  On the positive side, this change does not affect
> most people's use of rsync, so it would probably be safe.

and because due to the bug the result was -for the last years- inpredictable.

but i agree, i think it's a good idea to post and wait for responses on
the list. i'll do this job.

[..]



More information about the rsync mailing list