about rsyncing of block devices

Eliot Moss moss at cs.umass.edu
Tue Jun 15 09:36:00 MDT 2010


So this is an interesting problem to think about.
It seems to take this form:

- Data comes in fixed-size blocks
- Blocks may be copied elsewhere
- Blocks may be modified in place
- Blocks may be copied with modification

BUT

- There are no insertions/deletions
- The overall file does not change in
   size (or does so in terms of whole blocks)

This suggests restarting the "match with
preceding data in the same file" algorithm
at each block boundary. Also, it could match
either data at lower address or higher addresses,
but assuming we progress from lower to higher,
the "lower address" data obviously has to be
from the new version of the file and "higher
address" data from the old version.

A simple form of the algorithm would attempt
to produce, in order, a directive for each block
that would (a) leave the block as is, (b) copy
the block from another block on the destination,
or (c) provide new data for the block.

Case (c) could be further refined to analyze the
block using the traditional rsync algorithm to
reduce the number of bytes transferred.

While rsync obviously works for arbitrary data,
it is organized to detect similarities at any
offset, and thus to handle insertion/deletion
of bytes well.

I further observe that *in place* update, if
performed in a predetermined order, such as low
to high addresses, deals less efficiently with
some shufflings of data than with others.  In
particular, if an application copies from data
from low addresses to high ones, and the algorithm
also processes in that same order, then the copied
data will have to be resynthesized/transferred,
since the old copy will have been written over
at the destination before we get to place for the
new copy.

It would be possible in principle to do in-place
update *and* to use a scratch temp file at the
destination to hold data that you were about to
overwrite and will want later in the process.
This avoid network bandwidth at the cost of I/O
bandwidth / space at the destination.  I think
the maximum space required is half the size of
the old version of the file at the destination,
though the simplest use of the temp file might
lead to an upper bound of the size of the target
file (imagine a cyclic shuffling of blocks,
where all but the last move up one "slot" and
the last moves to the first slot).  Maybe
rsync already does this?

I don't know enough about the internals of
rsync to know what all this might suggest ...

Regards -- Eliot Moss


More information about the rsync mailing list