Question about --partial-dir and outdated data

Fri Aug 10 08:28:08 MDT 2012

Hi all,

rsync is a fantastic tool. :-) I'm blown away with what I've seen so far.

I have a question about --partial-dir transfers. I've read through this
thread:
http://lists.samba.org/archive/rsync/2011-July/026575.html
...but while similar, I don't think it's quite the same, and I didn't find
my answer there.

The short(ish) version:

1. Am I correct in inferring that when rsync sees data for a file in the
--partial-dir directory, it applies its delta transfer algorithm to the
partial file?

2. And that this is _instead of_ applying it to the real target file? (Not
a nifty three-way combination.)

If so, it would appear that this means a large amount of unnecessary data
may end up being transferred in the second sync of a large file if you
interrupt the first sync. Is there an option or some such to address this?
If not, would it be feasible to add? (Details on how I see that working
below, and I may be able to pitch in.)

The long version:

Sometimes I need to sync very large files (VM disk images) using ssh,
during an eight-hour time window. With my connection to the target server,
eight hours is unlikely to be enough, so I'll have to interrupt the sync
and continue it in the next day's window. Sometimes, the VM disk image will
be changed again in the meantime, but this isn't necessary to trigger the
behavior I mentioned above. (It is a case I'll have to handle.)

I've run a few experiments with rsync in this area, and it looks like it
causes a fair bit of unnecessary data transfer.

Here's how I caused that:

1. I created a file with 100,000 lines of text with exactly the same
length, and put it in both the source and destination.

2. In the source copy, I modified the first 20K lines. So roughly 20% of
the file has been changed. I didn't change the *length* of the lines (in
any of these experiments), because I'm trying to emulate a VM disk file
which is conveniently organized into fixed-size blocks.

3. I started a sync:

rsync -avr --partial-dir=.rstmp src username at server:/dest/

...and cancelled it part-way through. This leaves a partial file in my
.rstmp directory as expected. (In my case, just the first few hundred
lines.)

4. I restarted the sync, allowing it to complete.

The second sync ended up transferring nearly the entire file, basically the
whole 100K lines minus the few hundred from the first sync. The 80K of
unchanged lines were transferred, whereas if I hadn't interrupted the first
sync, they wouldn't have been.

I followed up with this experiment:

1. Starting with a synced file, I changed 20K lines in the *middle* of the
file rather than at the beginning.

2. I started a sync and cancelled it part-way through, after about the same
amount of time as the previous experiment. This leaves a partial file in my
.rstmp directory as expected -- but it's a LOT bigger, rsync has quite
intelligently copied the unchanged beginning of the file locally on the
target machine, up until the first change, and then transferred the changed
data after that -- which is when I interrupted it.

3. I started the sync again and let it continue, and it sent all of the
rest of the file, the vast majority of which was already present in the
original target file.

In subsequent experiments, I was able to determine that if I changed part
of the file that had already been transferred into the partial file (say,
changing line 1 between steps 2 and 3 above), rsync was very smart about
that, just transferring the changed bit without re-transferring everything
in-between. That's why it seems to me it uses the full delta-transfer
algorithm on the partial -- or at least some version of it.

All of this seems to suggest that the partial file is created by copying
the target file up to the first change and then applying changes -- but
that if you interrupt it, because the partial file is shorter than the
source file, all of the remaining source file is transferred.

Armed with that information, I tried to box clever: I thought "If I know
I'm going to be doing one of these big files, maybe I could just copy the
target to the .rstmp on the target machine in advance, so the
delta-transfer applies to it." Unfortunately, though, cancelling the
transfer early truncates the partial file. Drat. It wouldn't have been
particularly elegant, but still would have been a workaround for now.

If I'm right about all of the above (which I wouldn't put money on), it
seems like it would be possible to address this in a logically simple way.
Logically simple doesn't equate to being simple in code, of course. :-) The
idea being, basically, that when referring to blocks in the target partial
file (whether for determining the checksum of the block or transferring the
data), if the target partial file is missing the block entirely, use the
equivalent block from the actual target file -- so for checksum purposes,
that tells us whether it changed, and for data transfer purposes if it
didn't change, we know we can copy it locally on the target server.

If there isn't already an option to address this, would it be feasible to
do? I may be able to pitch in if so.

Thanks in advance,
--
T.J. Crowder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20120810/59421fc7/attachment.html>