Compressed backup
Ben Escoto
bescoto at stanford.edu
Tue Jun 4 10:19:02 EST 2002
>>>>> "BE" == Ben Escoto <bescoto at stanford.edu>
>>>>> wrote the following on Tue, 04 Jun 2002 10:02:58 -0700
KE> When I finally took the time to properly read Rusty's
KE> "gzip-rsyncable" patch[1] while writing this mail, I discovered
KE> that it appears to use this same general technique, although the
KE> heuristic he has chosen is "the sum of the previous 4096 bytes
KE> of source text is divisible by 4096". I think my heuristic
KE> could allow the gzips to sync up faster (after 12 identical
KE> bytes of source text following a change, compared to 4096),
KE> whilst still having the same compression ratio hit (resets every
KE> 4096 bytes, on average), but I've only come up with this today
KE> so I haven't done as much thinking on the subject as Rusty -
KE> likely there is some "gotcha" that I haven't seen.
BE> Anyway, that is just an example, the mod 4096 scheme seems
BE> to be "more random" and less data dependent than yours, but I
BE> don't have a good way of saying this clearly, or proving that
BE> his scheme maximizes this property.
Sorry, how about this: Let's suppose the data is a generated by a
Bernoulli process, so each byte is independent of every other and
determined by the same probability distribution. Then it appears the
probability of his scheme generating a reset at any given byte is
still 1/4096. However, yours may have between a 0 chance and a
(8/12)^8 * (4/12)^4 =approx 1/2075 chance (because you included more
1's than 0's - different people may have more equally distributed
initials...). So at least in this sense, his is constant of different
probability distributions while yours can vary more dramatically.
Of course, real data isn't generated by a Bernoulli process
generally, but I think this is a pretty good model and the results
would hold for similar types (i.e. low order Markov process).
--
Ben Escoto
More information about the rsync
mailing list