Avoiding transferring duplicate files
Tim Brody
tdb01r at ecs.soton.ac.uk
Tue Feb 12 11:53:45 GMT 2008
Hi All,
I have a 75GB collection of data, including a lot of duplicated files,
on a NTFS network drive. I want to backup that data across a DSL link to
a Linux host. Currently I use cwrsync on a Windows machine to act as
server to the Linux rsync client.
I want to avoid transferring duplicated data, as the DSL link is a far
more significant factor than computation/disk IO. I can't work out
whether rsync (or any patch) will make it smart enough to spot duplicate
files, regardless of file location (like fdupes or similar). Because
this is coming off a network drive there's no way I can "hard link" (or
NTFS equivalent) duplicates on the source tree, so it needs to happen in
rsync.
I've tried using the --detect-renamed patch on 3.0.0 in the following
(made up) set up:
src/
src/dup
src/dup/tardis.mp3
src/tardis.mp3
src/tardis2.mp3
../rsync-3.0.0pre9/rsync -avi --detect-renamed --fuzzy --checksum src/ dest/
building file list ... done
.d..t...... ./
>f+++++++++ tardis.mp3
>f+++++++++ tardis2.mp3
cd+++++++++ dup/
>f+++++++++ dup/tardis.mp3
sent 167076 bytes
Which is 3x the size of "tardis.mp3".
If I remove tardis2.mp3:
>f+++++++++ tardis2.mp3
sent 536 bytes received 526 bytes 193.09 bytes/sec
If I remove dup/tardis.mp3:
>f+++++++++ dup/tardis.mp3
sent 55801 bytes received 34 bytes 111670.00 bytes/sec
I've found some threads about duplicate files/the bug related to the
detect-renamed above, but nothing specifically about doing a blanket
search for duplicates similar to fdupes.
Any suggestions would be helpful.
Thanks,
Tim.
More information about the rsync
mailing list