data deduplication

Tue May 25 04:54:28 MDT 2010

At 06:41 25.05.2010 -0400, Mag Gam wrote:
>I know rsync can do many things but I was wondering if anyone is using
>it for data deduplication on a large filesystem. I have a filesystem
>which is about 2TB and I want to make sure I don't have the same data
>in a different place of a filesystem. Is there an algorithm for that?

I think for just finding same files you might be better off with SHA, MD5
oder other hashes. Even if there are same hashes I'd check first before
removing one of them, or at least make a full compare once the hashes
match. I don't think rsync is up to the task unless you want e.g. merge
two slightly different trees and then delete the remainder. rsync could
help in that case but only if the file names match. If you want to find
same files with different names I think you can't do it with rsync.

bye  Fabi