[cifs-protocol] [MS-XCA] LZ77+ Huffman questions EOF/256 and decoded file length
Douglas Bagnall
douglas.bagnall at catalyst.net.nz
Fri Oct 14 04:23:17 UTC 2022
hi Dochelp and others,
I have questions about the role of the LZ77 + Huffman EOF symbol.
The 0x100 byte encodes a match (of distance 1, length 3) and is also used as an
EOF marker (the equivalence is made explicit in the section 3.2 example, and
implied elsewhere).
I have a few questions around how this works, and a suggestion that transmitting
the original length is an essential part of the algorithm that is not described.
1. is the 256 symbol ever used as a match? I see nothing in MS-XCA to suggest
that it is not, but I have seen implementations that assume it won't be (i.e.
the only 256 will be the EOF).
2. In "2.1.4.3 Final Encoding Phase", 256 is just an "example" of and
end-of-data marker that "certain implementations" require:
> Implementations of the decompression algorithm may expect an extra symbol to mark the end of the
> data. For example, certain implementations fail during decompression if the Huffman symbol 256 is
> not found after the actual data. For this reason, the encoding algorithm SHOULD append this symbol
> and increment the count of symbol 256 before the Huffman codes are constructed.
The first two sentences seem to suggest there might be other choices for the EOF
symbol, but I take it this is accidental. It's always 256, right?
I also read in this that some decompression implementations ignore (i.e. don't
insist upon) the EOF, noticing the end of transmission by other means. Is that a
valid reading?
3. The decompression algorithm in "2.2.4 Processing" contains this clause:
Else If HuffmanSymbol == 256 and
the entire input buffer has been read and
the expected decompressed size has been written to the output buffer
Decompression is complete. Return with success.
which *seems* clear enough. Unfortunately the exact end of the input buffer is a
little indistinct (I'll show what I mean shortly), and the 256 symbol is perhaps
optional, so the only definite end condition is the size of the output buffer.
I guess the decompressed length is transmitted out-of-band in protocol specific
ways, but it is essential for the algorithm to work.
OK, here's what I mean by "reading the entire input buffer" being imprecise.
Consider the second example in section 3.2, which decompresses into 100
repetitions of "abc". The first 256 bytes set up the Huffman table, which looks
like (decoded by hand, forgive mistakes):
c: 00
EOF: 01
287: 10
a: 110
b: 111
Then we have seven bytes:
a8 dc 00 00 ff 26 01
From 2.2.4 we read the first 32 bits as two little-endian 16 bit numbers, so
we're looking at the bits:
110-111-00 10-1-01-000 00000000 00000000
a b c 287 | EOF
|
this is the 287 distance low bit.
287 stands for "a long match at distance 3", where "long" means read the
distance from subsequent bytes; first it takes the 0xff, which means read two
more, so it takes 0x0126, which is the 297 repetitions we need, and which ends
the input stream, in one sense. But of the 32 bits we read in, 19 remain
unexamined. We don't look at them, so how do we know they don't, for example, go:
110-111-00 10-1-01-00-0 0-10-1-01-00000 00000000
a b c 287 ccc c c 287 EOF
meaning that the supposed EOF was not EOF at all but a match of (1, 3),
repeating 'c's, then there could be other symbols, then an EOF later on. How do
we know it doesn't do that? because we know the expected output length.
I have more LZ77 + Huffman questions coming soon.
cheers,
Douglas
More information about the cifs-protocol
mailing list