Compressed backup

Tue Jun 4 10:19:02 EST 2002

>>>>> "BE" == Ben Escoto <bescoto at stanford.edu>
>>>>> wrote the following on Tue, 04 Jun 2002 10:02:58 -0700

  KE> When I finally took the time to properly read Rusty's
  KE> "gzip-rsyncable" patch[1] while writing this mail, I discovered
  KE> that it appears to use this same general technique, although the
  KE> heuristic he has chosen is "the sum of the previous 4096 bytes
  KE> of source text is divisible by 4096".  I think my heuristic
  KE> could allow the gzips to sync up faster (after 12 identical
  KE> bytes of source text following a change, compared to 4096),
  KE> whilst still having the same compression ratio hit (resets every
  KE> 4096 bytes, on average), but I've only come up with this today
  KE> so I haven't done as much thinking on the subject as Rusty -
  KE> likely there is some "gotcha" that I haven't seen.

  BE>     Anyway, that is just an example, the mod 4096 scheme seems
  BE> to be "more random" and less data dependent than yours, but I
  BE> don't have a good way of saying this clearly, or proving that
  BE> his scheme maximizes this property.

Sorry, how about this:  Let's suppose the data is a generated by a
Bernoulli process, so each byte is independent of every other and
determined by the same probability distribution.  Then it appears the
probability of his scheme generating a reset at any given byte is
still 1/4096.  However, yours may have between a 0 chance and a
(8/12)^8 * (4/12)^4 =approx 1/2075 chance (because you included more
1's than 0's - different people may have more equally distributed
initials...).  So at least in this sense, his is constant of different
probability distributions while yours can vary more dramatically.

    Of course, real data isn't generated by a Bernoulli process
generally, but I think this is a pretty good model and the results
would hold for similar types (i.e. low order Markov process).

--
Ben Escoto