Compressed backup
Ben Escoto
bescoto at stanford.edu
Tue Jun 4 10:07:01 EST 2002
>>>>> "KE" == Kevin Easton <s3159795 at student.anu.edu.au>
>>>>> wrote the following on Tue, 4 Jun 2002 17:43:17 +1000
KE> When I finally took the time to properly read Rusty's
KE> "gzip-rsyncable" patch[1] while writing this mail, I discovered
KE> that it appears to use this same general technique, although the
KE> heuristic he has chosen is "the sum of the previous 4096 bytes
KE> of source text is divisible by 4096". I think my heuristic
KE> could allow the gzips to sync up faster (after 12 identical
KE> bytes of source text following a change, compared to 4096),
KE> whilst still having the same compression ratio hit (resets every
KE> 4096 bytes, on average), but I've only come up with this today
KE> so I haven't done as much thinking on the subject as Rusty -
KE> likely there is some "gotcha" that I haven't seen.
That is interesting. Thank you very much for clearly presenting that
idea. As for your suggestion, if you don't mind speculation from
someone who never read the patch, it could be he chose the mod 4096
scheme so it could handle different kinds of data better.
For instance, your "the least significant bits of the previous 12
bytes are 101111010101" might be better for random data, but also
might never generate a reset for files with long stretches of
identical bytes, perhaps with a few variations thrown in. His would
trigger a reset after any block of 4096 identical characters.
Anyway, that is just an example, the mod 4096 scheme seems to be
"more random" and less data dependent than yours, but I don't have a
good way of saying this clearly, or proving that his scheme maximizes
this property.
--
Ben Escoto
More information about the rsync
mailing list