Question about --partial-dir and aborted transfers of large files

T.J. Crowder tj at crowdersoftware.com
Fri Aug 10 10:03:53 MDT 2012


Apologies to the list, the title of this thread is completely wrong. It
should be something like "Question about --partial-dir and aborted
transfers of large files". Let's see if this mailing list program will
allow me to change it...

-- T.J.


On 10 August 2012 15:28, T.J. Crowder <tj at crowdersoftware.com> wrote:

> Hi all,
>
> rsync is a fantastic tool. :-) I'm blown away with what I've seen so far.
>
> I have a question about --partial-dir transfers. I've read through this
> thread:
> http://lists.samba.org/archive/rsync/2011-July/026575.html
> ...but while similar, I don't think it's quite the same, and I didn't find
> my answer there.
>
> The short(ish) version:
>
> 1. Am I correct in inferring that when rsync sees data for a file in the
> --partial-dir directory, it applies its delta transfer algorithm to the
> partial file?
>
> 2. And that this is _instead of_ applying it to the real target file? (Not
> a nifty three-way combination.)
>
> If so, it would appear that this means a large amount of unnecessary data
> may end up being transferred in the second sync of a large file if you
> interrupt the first sync. Is there an option or some such to address this?
> If not, would it be feasible to add? (Details on how I see that working
> below, and I may be able to pitch in.)
>
> The long version:
>
> Sometimes I need to sync very large files (VM disk images) using ssh,
> during an eight-hour time window. With my connection to the target server,
> eight hours is unlikely to be enough, so I'll have to interrupt the sync
> and continue it in the next day's window. Sometimes, the VM disk image will
> be changed again in the meantime, but this isn't necessary to trigger the
> behavior I mentioned above. (It is a case I'll have to handle.)
>
> I've run a few experiments with rsync in this area, and it looks like it
> causes a fair bit of unnecessary data transfer.
>
> Here's how I caused that:
>
>  1. I created a file with 100,000 lines of text with exactly the same
> length, and put it in both the source and destination.
>
> 2. In the source copy, I modified the first 20K lines. So roughly 20% of
> the file has been changed. I didn't change the *length* of the lines (in
> any of these experiments), because I'm trying to emulate a VM disk file
> which is conveniently organized into fixed-size blocks.
>
> 3. I started a sync:
>
> rsync -avr --partial-dir=.rstmp src username at server:/dest/
>
> ...and cancelled it part-way through. This leaves a partial file in my
> .rstmp directory as expected. (In my case, just the first few hundred
> lines.)
>
> 4. I restarted the sync, allowing it to complete.
>
> The second sync ended up transferring nearly the entire file, basically
> the whole 100K lines minus the few hundred from the first sync. The 80K of
> unchanged lines were transferred, whereas if I hadn't interrupted the first
> sync, they wouldn't have been.
>
> I followed up with this experiment:
>
> 1. Starting with a synced file, I changed 20K lines in the *middle* of the
> file rather than at the beginning.
>
> 2. I started a sync and cancelled it part-way through, after about the
> same amount of time as the previous experiment. This leaves a partial file
> in my .rstmp directory as expected -- but it's a LOT bigger, rsync has
> quite intelligently copied the unchanged beginning of the file locally on
> the target machine, up until the first change, and then transferred the
> changed data after that -- which is when I interrupted it.
>
> 3. I started the sync again and let it continue, and it sent all of the
> rest of the file, the vast majority of which was already present in the
> original target file.
>
> In subsequent experiments, I was able to determine that if I changed part
> of the file that had already been transferred into the partial file (say,
> changing line 1 between steps 2 and 3 above), rsync was very smart about
> that, just transferring the changed bit without re-transferring everything
> in-between. That's why it seems to me it uses the full delta-transfer
> algorithm on the partial -- or at least some version of it.
>
> All of this seems to suggest that the partial file is created by copying
> the target file up to the first change and then applying changes -- but
> that if you interrupt it, because the partial file is shorter than the
> source file, all of the remaining source file is transferred.
>
> Armed with that information, I tried to box clever: I thought "If I know
> I'm going to be doing one of these big files, maybe I could just copy the
> target to the .rstmp on the target machine in advance, so the
> delta-transfer applies to it." Unfortunately, though, cancelling the
> transfer early truncates the partial file. Drat. It wouldn't have been
> particularly elegant, but still would have been a workaround for now.
>
> If I'm right about all of the above (which I wouldn't put money on), it
> seems like it would be possible to address this in a logically simple way.
> Logically simple doesn't equate to being simple in code, of course. :-) The
> idea being, basically, that when referring to blocks in the target partial
> file (whether for determining the checksum of the block or transferring the
> data), if the target partial file is missing the block entirely, use the
> equivalent block from the actual target file -- so for checksum purposes,
> that tells us whether it changed, and for data transfer purposes if it
> didn't change, we know we can copy it locally on the target server.
>
> If there isn't already an option to address this, would it be feasible to
> do? I may be able to pitch in if so.
>
> Thanks in advance,
> --
> T.J. Crowder
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20120810/478c4878/attachment.html>


More information about the rsync mailing list