Possibility of merging rsync and tar

Sun Sep 11 22:21:13 GMT 2005

Hey rsync people,

Here's a really radical idea and a possible future direction for the
rsync project to explore.

It occurs to me that tar and rsync are closely related in their
purposes.  "tar -c (blah) | tar -x" can be used to copy files; rsync's
setup with a sender process and a receiver process is strikingly
similar.

The only major conceptual difference is that the rsync protocol uses
two-way communication to transmit only what has changed, while tar
always transmits a complete snapshot of a collection of files.

Since both tar and rsync read and write filesystems in great detail,
they have many analogous sections of source code.  For example, both
programs set permissions on received/extracted files in two passes:
first by supplying a mode to "open" and then with an explicit "chmod".
Many options correspond, and not just the obvious "preserve-this,
preserve-that" ones: the sending-end "--chmod" option that can be added
to rsync with a distributed patch is analogous to tar's "--mode" option.

So I am led to ask: is there a practical way to merge tar and rsync into
one program whose focus is capturing and recreating collections of
files?  This program could be invoked in many different ways to copy
between archives and files, both local and remote.

But something tells me that it would be a pain to get this program to
communicate in two "modes", two-way protocol and complete snapshot.  I
am inclined to use a concept that has already appeared plenty in
pluggable multimedia systems: multiple "sources" and "sinks".  Let's
standardize on a single two-way protocol based on the rsync one.  Then
there can be a filesystem source, an archive source, a filesystem sink,
and an archive sink.  The user can then run an arbitrary source and an
arbitrary sink, possibly on different machines, and they communicate
through a pipe using the two-way protocol.

The difference is that, when the source says "do you already have a file
at path X that has the same checksum as mine?" the archive sink will say
"umm...give me the whole file, please" while the filesystem sink might
say "yes, I do".  Actually, an archive sink whose options dictate that
it send the archive to _standard_output_ will always behave in this
fashion, but an archive sink that has an old archive to consult might be
able to optimize away much of the transmission.  One could even
synchronize changes between two archive files on different machines,
possibly translating permissions or other attributes in the process.

Of course, there could be different sources and sinks for different
archive formats and maybe even for exotic filesystems that don't use the
POSIX interface.  These could even be packaged separately in shared
libraries and loaded by an rsync core that carries out the standard
protocol.

There's one obvious drawback to this approach.  It will probably be
noticeably slower than traditional tar at simple archiving operations
since the data has to pass through one more pipe.  Shared memory or even
running both source and sink in the same process are likely to help for
local transfers.
-- 
Matt McCutchen, ``hashproduct''
hashproduct at verizon.net -- http://mysite.verizon.net/hashproduct/