[cifs-protocol] [MS-XCA] LZ77+ Huffman questions EOF/256 and decoded file length

Fri Oct 14 04:23:17 UTC 2022

hi Dochelp and others,

I have questions about the role of the LZ77 + Huffman EOF symbol.

The 0x100 byte encodes a match (of distance 1, length 3) and is also used as an 
EOF marker (the equivalence is made explicit in the section 3.2 example, and 
implied elsewhere).

I have a few questions around how this works, and a suggestion that transmitting 
the original length is an essential part of the algorithm that is not described.

1. is the 256 symbol ever used as a match? I see nothing in MS-XCA to suggest 
that it is not, but I have seen implementations that assume it won't be (i.e. 
the only 256 will be the EOF).

2. In "2.1.4.3 Final Encoding Phase", 256 is just an "example" of and 
end-of-data marker that "certain implementations" require:

> Implementations of the decompression algorithm may expect an extra symbol to mark the end of the
> data. For example, certain implementations fail during decompression if the Huffman symbol 256 is
> not found after the actual data. For this reason, the encoding algorithm SHOULD append this symbol
> and increment the count of symbol 256 before the Huffman codes are constructed.

The first two sentences seem to suggest there might be other choices for the EOF 
symbol, but I take it this is accidental. It's always 256, right?

I also read in this that some decompression implementations ignore (i.e. don't 
insist upon) the EOF, noticing the end of transmission by other means. Is that a 
valid reading?

3. The decompression algorithm in "2.2.4 Processing" contains this clause:

     Else If HuffmanSymbol == 256 and
             the entire input buffer has been read and
             the expected decompressed size has been written to the output buffer
         Decompression is complete. Return with success.

which *seems* clear enough. Unfortunately the exact end of the input buffer is a 
little indistinct (I'll show what I mean shortly), and the 256 symbol is perhaps 
optional, so the only definite end condition is the size of the output buffer.

I guess the decompressed length is transmitted out-of-band in protocol specific 
ways, but it is essential for the algorithm to work.

OK, here's what I mean by "reading the entire input buffer" being imprecise. 
Consider the second example in section 3.2, which decompresses into 100 
repetitions of "abc". The first 256 bytes set up the Huffman table, which looks 
like (decoded by hand, forgive mistakes):

  c:   00
  EOF: 01
  287: 10
  a:   110
  b:   111

Then we have seven bytes:

      a8 dc 00 00 ff 26 01

 From 2.2.4 we read the first 32 bits as two little-endian 16 bit numbers, so 
we're looking at the bits:

    110-111-00  10-1-01-000   00000000   00000000
     a  b   c  287 | EOF
                   |
                   this is the 287 distance low bit.

287 stands for "a long match at distance 3", where "long" means read the 
distance from subsequent bytes; first it takes the 0xff, which means read two 
more, so it takes 0x0126, which is the 297 repetitions we need, and which ends 
the input stream, in one sense. But of the 32 bits we read in, 19 remain 
unexamined. We don't look at them, so how do we know they don't, for example, go:

   110-111-00  10-1-01-00-0  0-10-1-01-00000   00000000
    a  b   c  287  ccc c    c 287   EOF

meaning that the supposed EOF was not EOF at all but a match of (1, 3), 
repeating 'c's, then there could be other symbols, then an EOF later on. How do 
we know it doesn't do that? because we know the expected output length.

I have more LZ77 + Huffman questions coming soon.

cheers,
Douglas