Reducing network usage even more

Fri Dec 23 12:23:04 UTC 2016

Hello,

There's a certain use case where rsync cannot act as bandwidth-
efficient as possible. Therefore I was wondering whether rsync could
be extended (albeit not trivially) to a more content-addressable way
of operation, or whether there's an application around that could
serve me better.

Situation: Synchronise huge file trees over a rather thin link.
Additionally, the receiving side very likely already has the file by
its content - but under a different name and/or path.

The idea that comes to mind now is something like a pool of files by
content (probably: hashsum) on the receiving side that have already
been seen. Then the receiving rsync could check not only the file at
the given path for identity but also against every file in that pool.
Care has to be taken this happens in an efficient way. And this will
probably require a protocol extension as well.

Additionally, I'd also need the ability to add files to that pool
manually: Sometimes the receiving side could get the files from a
third side at a higher speed.

One might argue this was a job for git - that's however not an option
here. To start with, git doesn't scale very well to a millon of files.
Another idea was to switch to a fuse filesystem that uses content-
addressable storage, then use rsync to replicate the lower layer
(basically files by hashsum and a database), but that one would have
to be rock-solid then.

    Christoph