Compressed backup

Ben Escoto bescoto at stanford.edu
Tue Jun 4 10:07:01 EST 2002


>>>>> "KE" == Kevin Easton <s3159795 at student.anu.edu.au>
>>>>> wrote the following on Tue, 4 Jun 2002 17:43:17 +1000

  KE> When I finally took the time to properly read Rusty's
  KE> "gzip-rsyncable" patch[1] while writing this mail, I discovered
  KE> that it appears to use this same general technique, although the
  KE> heuristic he has chosen is "the sum of the previous 4096 bytes
  KE> of source text is divisible by 4096".  I think my heuristic
  KE> could allow the gzips to sync up faster (after 12 identical
  KE> bytes of source text following a change, compared to 4096),
  KE> whilst still having the same compression ratio hit (resets every
  KE> 4096 bytes, on average), but I've only come up with this today
  KE> so I haven't done as much thinking on the subject as Rusty -
  KE> likely there is some "gotcha" that I haven't seen.

That is interesting.  Thank you very much for clearly presenting that
idea.  As for your suggestion, if you don't mind speculation from
someone who never read the patch, it could be he chose the mod 4096
scheme so it could handle different kinds of data better.

    For instance, your "the least significant bits of the previous 12
bytes are 101111010101" might be better for random data, but also
might never generate a reset for files with long stretches of
identical bytes, perhaps with a few variations thrown in.  His would
trigger a reset after any block of 4096 identical characters.

    Anyway, that is just an example, the mod 4096 scheme seems to be
"more random" and less data dependent than yours, but I don't have a
good way of saying this clearly, or proving that his scheme maximizes
this property.


--
Ben Escoto




More information about the rsync mailing list