Question about --partial-dir and aborted transfers of large files

Sun Aug 12 12:33:51 MDT 2012

Hi,

Thanks for that!

On 12 August 2012 18:41, Wayne Davison <wayned at samba.org> wrote:

> I have imagined making the code pretend that the partial file and any
> destination file are concatenated together for the purpose of generating
> checksums.  That would allow content references to both files, but rsync
> would need to be enhanced to open both files in both the generator and the
> receiver and be able to figure out what read goes where (which shouldn't be
> too hard).  I'd suggest that the code read the partial file first, padding
> out the end of its data to an even checksum-sized unit so that the
> destination file starts on a even checksum boundary (so that the code never
> needs to combine data from two files in a single checksum or copy
> reference).
>

So if I'm inspired and somehow magically find the time, it's at least
feasible.

I'm not seeing why the generator would need to be different, though; the
receiver would be doing the see-through magic (treating the partial as
though it were overlaid on the beginning of the target).

> If so, it would appear that this means a large amount of unnecessary data
>> may end up being transferred in the second sync of a large file if you
>> interrupt the first sync.
>>
>
> It all depends on where you interrupt it and how much data matches in the
> remaining portion of the destination file.  It does give you the option of
> discarding the partial data if it is too short to be useful, or possibly
> doing your own concatenation of the whole (or trailing portion) of the
> destination file onto the partial file, should you want to tweak things
> before resuming the transfer.
>

Ah, yes, I _nearly_ got there, didn't I, with my "boxing clever"
workaround. If one knows one's in this situation, just append data from the
target file to the partial file to fill in the missing bits (e.g., if the
target is 100K and the partial is 20K, append the _last_ 80K of target to
partial), and when rsync runs it'll only send what it has to. A C program
to recursively walk a tree and do that on the selected partials where it
makes sense (e.g., my VM HDD files) and not to others (which might have
deletions or insertions) is probably 20-30 lines of code.

On 12 August 2012 19:08, Wayne Davison <wayned at samba.org> wrote:

> On Sun, Aug 12, 2012 at 10:41 AM, Wayne Davison <wayned at samba.org> wrote:
>
>> I have imagined making the code pretend that the partial file and any
>> destination file are concatenated together for the purpose of generating
>> checksums.
>>
>
> Actually, that could be bad if the destination and partial file are both
> huge.  What would be better would be to send just the size of the
> destination file in checksums, but overlay the start of the destination's
> data with the partial-file's data (and just ignore any partial-block from
> the end of the partial file).
>

Yes, I wasn't thinking concatenation, but more like what LVM and similar do
with snapshots: The partial file is a bunch of snapshot blocks with the
curious property of only being at the beginning of the file. So given a
file with 50K blocks, and a partial with 20K blocks, the code would view
the combined result as the first 20K blocks of the partial followed by the
subsequent 30K blocks from the target. (Hence my "see through" terminology
above.)

E.g., resorting to ASCII-art, the receiver code see a virtual file:

                       +--------------+
                       | partial file |
+--------------+       +--------------+ +--------------+
| virtual file |  +--->| Blks 0-9K    | |  target file |
+--------------+  | +->| Blks 10K-19K | +--------------+
| Blks 0-9K    |--+ |  +--------------+ | Blks 0-9K    |
| Blks 10K-19K |----+                   | Blks 10K-19K |
| Blks 20K-29K |----------------------->| Blks 20K-29K |
| Blks 30K-39K |----------------------->| Blks 30K-39K |
| Blks 40K-49K |----------------------->| Blks 40K-49K |
+--------------+                        +--------------+

The receiver would perform checksums against that virtual file, and when
time to copy a block, if the block needs to be transferred, do that; if
not, grab it from the target file.

Again, this all really only applies in the simple case of files that are
nice, discrete blocks of data. Not knowing the delta algorithm, I have no
idea what would happen if the above were applied to a file that got (say)
5K of blocks deleted at the beginning followed by 1K blocks of inserted
data. The virtual file would appear to have duplicated data in that case,
which the delta algorithm would then have to get rid of / cope with. I
wouldn't be too surprised to find that it lead to inefficiency in other
types of files.

Thanks again,

-- T.J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20120812/783af412/attachment.html>