[cifs-protocol] [MS-XCA] LZ77+ Huffman questions EOF/256 and decoded file length - - TrackingID#2210140040005892

Mon Nov 14 18:32:17 UTC 2022

Hi Douglas:
Here is the answer to your following question:

Q. 1. is the 256 symbol ever used as a match? I see nothing in MS-XCA to suggest that it is not, but I have seen implementations that assume it won't be (i.e. 
the only 256 will be the EOF).

A: The symbol 256 is treated as EOF at the following points. Other than these, there is no special treatment:
1. Windows insert symbol 256 at the end of input buffer when compressing
2. When decompressing and Windows does not encounter EOF symbol and output position is at the output buffer end, Windows finish the decompression and does not generate an error for not finding EOF. This underscores the importance of providing the size of the output  that is exactly equal to what was the size of the input when compressing.

Please let me know if this does not answer your question.

Regards,
Obaid Farooqi
Escalation Engineer | Microsoft

-----Original Message-----
From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz> 
Sent: Thursday, October 13, 2022 11:23 PM
To: Interoperability Documentation Help <dochelp at microsoft.com>; cifs-protocol at lists.samba.org; Aurélien Aptel <aaptel at suse.com>; Samuel Cabrero (Samba) <scabrero at samba.org>
Subject: [EXTERNAL] [MS-XCA] LZ77+ Huffman questions EOF/256 and decoded file length

hi Dochelp and others,

I have questions about the role of the LZ77 + Huffman EOF symbol.

The 0x100 byte encodes a match (of distance 1, length 3) and is also used as an EOF marker (the equivalence is made explicit in the section 3.2 example, and implied elsewhere).

I have a few questions around how this works, and a suggestion that transmitting the original length is an essential part of the algorithm that is not described.

1. is the 256 symbol ever used as a match? I see nothing in MS-XCA to suggest that it is not, but I have seen implementations that assume it won't be (i.e. 
the only 256 will be the EOF).

2. In "2.1.4.3 Final Encoding Phase", 256 is just an "example" of and end-of-data marker that "certain implementations" require:

> Implementations of the decompression algorithm may expect an extra 
> symbol to mark the end of the data. For example, certain 
> implementations fail during decompression if the Huffman symbol 256 is 
> not found after the actual data. For this reason, the encoding algorithm SHOULD append this symbol and increment the count of symbol 256 before the Huffman codes are constructed.

The first two sentences seem to suggest there might be other choices for the EOF symbol, but I take it this is accidental. It's always 256, right?

I also read in this that some decompression implementations ignore (i.e. don't insist upon) the EOF, noticing the end of transmission by other means. Is that a valid reading?

3. The decompression algorithm in "2.2.4 Processing" contains this clause:

     Else If HuffmanSymbol == 256 and
             the entire input buffer has been read and
             the expected decompressed size has been written to the output buffer
         Decompression is complete. Return with success.

which *seems* clear enough. Unfortunately the exact end of the input buffer is a little indistinct (I'll show what I mean shortly), and the 256 symbol is perhaps optional, so the only definite end condition is the size of the output buffer.

I guess the decompressed length is transmitted out-of-band in protocol specific ways, but it is essential for the algorithm to work.

OK, here's what I mean by "reading the entire input buffer" being imprecise. 
Consider the second example in section 3.2, which decompresses into 100 repetitions of "abc". The first 256 bytes set up the Huffman table, which looks like (decoded by hand, forgive mistakes):

  c:   00
  EOF: 01
  287: 10
  a:   110
  b:   111

Then we have seven bytes:

      a8 dc 00 00 ff 26 01

 From 2.2.4 we read the first 32 bits as two little-endian 16 bit numbers, so we're looking at the bits:

    110-111-00  10-1-01-000   00000000   00000000
     a  b   c  287 | EOF
                   |
                   this is the 287 distance low bit.

287 stands for "a long match at distance 3", where "long" means read the distance from subsequent bytes; first it takes the 0xff, which means read two more, so it takes 0x0126, which is the 297 repetitions we need, and which ends the input stream, in one sense. But of the 32 bits we read in, 19 remain unexamined. We don't look at them, so how do we know they don't, for example, go:

   110-111-00  10-1-01-00-0  0-10-1-01-00000   00000000
    a  b   c  287  ccc c    c 287   EOF

meaning that the supposed EOF was not EOF at all but a match of (1, 3), repeating 'c's, then there could be other symbols, then an EOF later on. How do we know it doesn't do that? because we know the expected output length.

I have more LZ77 + Huffman questions coming soon.

cheers,
Douglas