Avoiding transferring duplicate files

Tue Feb 12 11:53:45 GMT 2008

Hi All,

I have a 75GB collection of data, including a lot of duplicated files, 
on a NTFS network drive. I want to backup that data across a DSL link to 
a Linux host. Currently I use cwrsync on a Windows machine to act as 
server to the Linux rsync client.

I want to avoid transferring duplicated data, as the DSL link is a far 
more significant factor than computation/disk IO. I can't work out 
whether rsync (or any patch) will make it smart enough to spot duplicate 
files, regardless of file location (like fdupes or similar). Because 
this is coming off a network drive there's no way I can "hard link" (or 
NTFS equivalent) duplicates on the source tree, so it needs to happen in 
rsync.

I've tried using the --detect-renamed patch on 3.0.0 in the following 
(made up) set up:

src/
src/dup
src/dup/tardis.mp3
src/tardis.mp3
src/tardis2.mp3

../rsync-3.0.0pre9/rsync -avi --detect-renamed --fuzzy --checksum src/ dest/
building file list ... done
.d..t...... ./
 >f+++++++++ tardis.mp3
 >f+++++++++ tardis2.mp3
cd+++++++++ dup/
 >f+++++++++ dup/tardis.mp3

sent 167076 bytes

Which is 3x the size of "tardis.mp3".

If I remove tardis2.mp3:
 >f+++++++++ tardis2.mp3

sent 536 bytes  received 526 bytes  193.09 bytes/sec

If I remove dup/tardis.mp3:
 >f+++++++++ dup/tardis.mp3

sent 55801 bytes  received 34 bytes  111670.00 bytes/sec

I've found some threads about duplicate files/the bug related to the 
detect-renamed above, but nothing specifically about doing a blanket 
search for duplicates similar to fdupes.

Any suggestions would be helpful.

Thanks,
Tim.