Large files and symlinks

Thu Jul 31 11:21:55 EST 2003

On Thu, 2003-07-31 at 10:01, jw schultz wrote:
> On Thu, Jul 31, 2003 at 09:22:51AM +1000, Donovan Baarda wrote:
> > On Thu, 2003-07-31 at 06:53, jw schultz wrote:
> > [...]
> > > In many cases invoking --partial is worse than not.  If you
> > > are rsyncing a 4GB file and transfer is interrupted after
> > > 500MB has been synced you get a 500MB file which now has
> > > less in common with the source than the 4GB file did.  
> > 
> > A more useful behaviour for --partial would be to concatinate the
> > partial download to the end of the old "basis", rather than replace
> > it... this leaves you with a much more useful "partial" result to resume
> > from.
> > 
> > Of course this behaviour could be _very_ confusing to people... :-)
> 
> Interesting idea.  I don't know that it would be all that confusing.
> 
> You'd have to truncate the basis to the length of the source
> to prevent it growing with each failure.  Even appending
> 3.5GB to a 4GB file once is problematic.

The simplest solution is to write the partial download over the
beginning of the old file, leaving the end part as it was.

This way you are making the sensible assumption that most of the matches
from the start of the file match the partial download, and the remainder
will match the rest when you resume. A match locality heuristic?

I suspect this will be less confusing for end users too... the partial
download will have it's size unchanged, and when you look at the data
you will be able to see that it "synchronised up to point xxx". It will
look like a partial in-place update.

> If we were to append to the existing file it might make
> sense to append only those portions that were updates.
> That would require keeping track of the offset+length of
> each change block.  Yuck, that is much more work that it is
> worth.

Yeah... very overkill.

> One idea that i think has real merit would be to combine
> some kind of change-rate score with an evaluation of
> comparative sizes of the tempfile and the original file to
> decide if replacing the original or leaving it would be more
> efficient.  If there was no data-reuse then replacement
> would be in order.  If there was a high rate of reuse it
> wouldn't.  If the reuse was middling you would consider the
> comparative sizes.  The formula would probably be pretty
> simple.  If someone comes up with a patch that does that i'd
> be willing to entertain it.

I'm not convinced this would be worth the effort... I'm sure in many
cases the beginning of the file is where most of the changes are, so
throwing away the end on the basis of poor matches at the start is a bad
idea.

-- 
Donovan Baarda <abo at minkirri.apana.org.au>