[cifs-protocol] [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks - TrackingID#2210140040006030

Douglas Bagnall douglas.bagnall at catalyst.net.nz
Wed Nov 9 22:51:06 UTC 2022


On 9/11/22 13:24, Douglas Bagnall via cifs-protocol wrote:
> hi Jeff,
> 
> That raises the question of how it encodes a match that long.
> 
> As far as the MS-XCA is concerned the longest possible match length is 65,538.
> 
> If the 2-byte length is not enough, does it go on to use a 4-byte length?

To address my own question I made a file with 67108886 bytes of 
"abcdefghijklmnopqrstuvwxyz". That's 22 bytes more than binary 64M (2^26). The 
SHA1 of this data is 204dabbdfce9ed63a99c9d777aae5dac8fd8ae87. Using the Windows 
Compression API, I got 283 bytes:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000030  50 55 55 55 55 55 55 55  55 55 55 55 45 04 00 00  |PUUUUUUUUUUUE...|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000a0  00 00 00 00 00 00 00 40  00 00 00 00 00 00 00 00  |....... at ........|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  54 42 35 b6 84 cf 3a 65  d7 56 75 c6 77 be 01 df  |TB5...:e.Vu.w...|
00000110  20 3a 00 00 ff 00 00 f9  ff ff 03                 | :.........|
0000011b

The "PUUUUUUUUUUUE" line is the Huffman codes for the letters, the 04 at byte 
0x80 is the EOF/256 (all as expected). The 40 at byte 0xa7 means there's a code 
for 0x14f, which is saying a match ('1__') at a distance with 4 less significant 
bits ('_4_'), for longer than 15 + 3 ('__f'). The distance is going to be 26, so 
the distance bits are going to be 26-16 == 1010 binary.

The 4 bit codes will be
0000  y
0001  z
0010  EOF
0011  match

The 5 bit codes, well, we don't need to work them out because we know they are 
'a' to 'x' therefore the first 24 codes of 5 bits == 120 bits. After those come 
the 4 bit 'y' and 'z', which is really handy because it ends on a nice round 
128. And looking at the first 128 bits,

    54 42 35 b6 84 cf 3a 65  d7 56 75 c6 77 be 01 df

it indeed looks like the last 2 bytes, read as the little-endian 16 bit 0xdf01,
contain the last of the alphabet ('x', being the last 5, must be 11111, before 
that 'w' must be 11110, so we expect [11] 110-11111-0000-0001, which is df01.

That means all the interesting stuff is in the last line:

      20 3a 00 00 ff 00 00 f9  ff ff 03

0x3a20 is 0011-1010-0010-0000, which is the match code, followed by the match 
distance, followed by the EOF, followed by padding zeros. The next two bytes are 
also padding zeros. Then the 0xff is the byte for the match length. Because it 
is 0xff, two more bytes will be read as a 16-bit length. These are zero, so the 
16 bit length is zero.

Now we're at the undocumented bit. Somehow we have to make "f9 ff ff 03" mean 
"67108886 - 26". If we read it as little-endian 32 bit, we get 0x03fffff9, which 
is 67108857. And if, as with the other lengths, we add 3, we get 67108860, which 
is 26 less than 67108886, the answer we want.

Is that how it really works? Should it be mentioned in MS-XCA?

Specifically, is a zero 16 bit length the trigger to look for a 32 bit length?

Note: although you found a boundary at 64M, I did not. The most likely 
explanation is that (due to a personal inability to compile and link using 
ntifs.h/ntoskrnl) I am using the userspace Compression API which drives the 
underlying  Rtl* functions in its own way, perhaps setting this mega-block size 
differently. Here they mention a 1GB "internal block size":

https://learn.microsoft.com/en-us/windows/win32/cmpapi/using-the-compression-api

which is maybe the same thing that is set to 64MB in MS-XCA code?

cheers,
Douglas



> Are you able to send me the compressed file?
> 
> thanks,
> Douglas
> 
> 
> On 9/11/22 12:44, Jeff McCashland (He/him) wrote:
>> Hi Douglas,
>>
>> I've found something interesting while researching this issue that might help.
>>
>> In the initial LZ77 encoding phase, matches are searched for in 64k blocks, as 
>> documented. However, when determining the length of the match, Windows will 
>> keep searching as long as the match continues, even if it continues through 
>> multiple 64k blocks, up to a total of 64 MB. I created a file > 64MB with 
>> lowercase a-z repeated so that the first match actually goes to the end of the 
>> 64 MB 'super-block'. The length of the match worked out to just under 64 MB, 
>> and the next pass started with the remainder of the file after 64MB.
>>
>> I hope that helps to explain some of the oddities you're seeing.
>>
>> Best regards,
>> Jeff McCashland (He/him) | Senior Escalation Engineer | Microsoft Protocol 
>> Open Specifications Team
>> Phone: +1 (425) 703-8300 x38300 | Hours: 9am-5pm | Time zone: (UTC-08:00) 
>> Pacific Time (US and Canada)
>> Local country phone number found here: 
>> http://support.microsoft.com/globalenglish | Extension 1138300
>>
>> -----Original Message-----
>> From: Jeff McCashland (He/him)
>> Sent: Thursday, October 27, 2022 9:57 AM
>> To: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>; 
>> cifs-protocol at lists.samba.org; Samuel Cabrero (Samba) <scabrero at samba.org>
>> Cc: Microsoft Support <supportmail at microsoft.com>
>> Subject: RE: [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks - 
>> TrackingID#2210140040006030
>>
>> Hi Douglas,
>>
>> Thank you for the fast response. I will continue digging into this.
>>
>> Best regards,
>> Jeff McCashland (He/him) | Senior Escalation Engineer | Microsoft Protocol 
>> Open Specifications Team
>> Phone: +1 (425) 703-8300 x38300 | Hours: 9am-5pm | Time zone: (UTC-08:00) 
>> Pacific Time (US and Canada) Local country phone number found here: 
>> http://support.microsoft.com/globalenglish | Extension 1138300
>>
>> -----Original Message-----
>> From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
>> Sent: Wednesday, October 26, 2022 2:10 PM
>> To: Jeff McCashland (He/him) <jeffm at microsoft.com>; 
>> cifs-protocol at lists.samba.org; Samuel Cabrero (Samba) <scabrero at samba.org>
>> Cc: Microsoft Support <supportmail at microsoft.com>
>> Subject: Re: [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks - 
>> TrackingID#2210140040006030
>>
>> hi Jeff,
>>
>> Thanks. Yes, I think you're understanding correctly, and that is a valid 
>> answer, and I *would* have happily accepted it, but in the meantime I have had 
>> the misfortune of re-reading MS-XCA again and again, and I now believe it 
>> contradicts this view.
>>
>> In 2.1.4.3 there is:
>>
>>     The following pseudocode demonstrates the encoding method.
>>
>>          Write the 256-byte table of symbol bit lengths
>>          While there are more literals or matches to encode
>>             [[write bits per the algorithm, not in question here]]
>>          WriteBits(SymbolLength[256], SymbolCode[256])
>>          FlushBits()
>>
>> This appears to be encoding a single block (there's one 256-byte table), and 
>> it ends with the FlushBits(), which is essentially the "ignore ghi..." in my 
>> example. However it also has a "WriteBits(SymbolLength[256], 
>> SymbolCode[256])", which I understand should only happen at the end of the 
>> last block.
>>
>> I think it would be accurate to say this pseudocode "demonstrates the encoding 
>> method for a message of 65536 or fewer bytes", but is unclear for multi-block 
>> messages.
>>
>>
>> And in section 2.2.4 the main decompression pseudocode loop starts like:
>>
>>     Loop until a decompression terminating condition
>>         Build the decoding table
>>         CurrentPosition = 256     // start at the end of the Huffman table
>>         NextBits = Read16Bits(InputBuffer + CurrentPosition)
>>         CurrentPosition += 2
>>         NextBits <<= 16
>>         NextBits |= Read16Bits(InputBuffer + CurrentPosition)
>>         CurrentPosition += 2
>>         ExtraBitCount = 16
>>
>>
>> which suggests that the bits "ghi..." are discarded because we are told 
>> implicitly in the text that Read16bits shifts input into a 32 bit register -- 
>> if we call it twice at the beginning of each block, whatever was in the 
>> register has to fall out the other end.
>>
>> cheers,
>> Douglas
>>
>>
>> On 27/10/22 05:48, Jeff McCashland (He/him) wrote:
>>> Hi Douglas,
>>>
>>> As I understand, each 64k block is processed separately. In other words, the 
>>> first 64k block is LZ77 compressed, then the Huffman codes are constructed 
>>> based on symbol frequency in that 64k. If, in your example, DEF ends the 64k 
>>> block, then the subsequent ghi... will be processed with the second 64k block 
>>> and Huffman table, and not dropped.
>>>
>>> Am I understanding your question correctly?
>>>
>>> Best regards,
>>> Jeff McCashland (He/him) | Senior Escalation Engineer | Microsoft
>>> Protocol Open Specifications Team
>>> Phone: +1 (425) 703-8300 x38300 | Hours: 9am-5pm | Time zone:
>>> (UTC-08:00) Pacific Time (US and Canada) Local country phone number
>>> found here:
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsuppo
>>> rt.microsoft.com%2Fglobalenglish&data=05%7C01%7Cjeffm%40microsoft.
>>> com%7C2eed4b9f9fda44b7458d08dab79674a8%7C72f988bf86f141af91ab2d7cd011d
>>> b47%7C1%7C0%7C638024154089025057%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
>>> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
>>> amp;sdata=gToAFtZwgAnmsIJgYkP9aVhup%2Blb5zsDN%2Bajebbzh18%3D&reser
>>> ved=0 | Extension 1138300
>>>
>>> -----Original Message-----
>>> From: Jeff McCashland (He/him)
>>> Sent: Friday, October 14, 2022 1:58 PM
>>> To: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>;
>>> cifs-protocol at lists.samba.org; Samuel Cabrero (Samba)
>>> <scabrero at samba.org>
>>> Cc: Microsoft Support <supportmail at microsoft.com>
>>> Subject: RE: [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks
>>> - TrackingID#2210140040006030
>>>
>>> [Tom to BCC]
>>>
>>> Hi Douglas,
>>>
>>> I'll research this question and let you know what I learn.
>>>
>>> Best regards,
>>> Jeff McCashland (He/him) | Senior Escalation Engineer | Microsoft
>>> Protocol Open Specifications Team
>>> Phone: +1 (425) 703-8300 x38300 | Hours: 9am-5pm | Time zone:
>>> (UTC-08:00) Pacific Time (US and Canada) Local country phone number
>>> found here:
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsuppo
>>> rt.microsoft.com%2Fglobalenglish&data=05%7C01%7Cjeffm%40microsoft.
>>> com%7C2eed4b9f9fda44b7458d08dab79674a8%7C72f988bf86f141af91ab2d7cd011d
>>> b47%7C1%7C0%7C638024154089025057%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
>>> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
>>> amp;sdata=gToAFtZwgAnmsIJgYkP9aVhup%2Blb5zsDN%2Bajebbzh18%3D&reser
>>> ved=0 | Extension 1138300
>>>
>>> -----Original Message-----
>>> From: Tom Jebo <tomjebo at microsoft.com>
>>> Sent: Friday, October 14, 2022 9:45 AM
>>> To: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>;
>>> cifs-protocol at lists.samba.org; Samuel Cabrero (Samba)
>>> <scabrero at samba.org>
>>> Cc: Microsoft Support <supportmail at microsoft.com>
>>> Subject: RE: [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks
>>> - TrackingID#2210140040006030
>>>
>>> [dochelp to bcc]
>>> [casemail cc]
>>>
>>> Hi Douglas,
>>>
>>> Thank you for your request. One of the Open Specifications team will respond 
>>> to start working with you. I have created case 2210140040006030 and added the 
>>> number to the subject of this email. Please refer to this case number in 
>>> future communications regarding this issue.
>>>
>>> Best regards,
>>> Tom Jebo
>>> Sr Escalation Engineer
>>> Microsoft Open Specifications
>>>
>>> -----Original Message-----
>>> From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
>>> Sent: Thursday, October 13, 2022 9:57 PM
>>> To: Interoperability Documentation Help <dochelp at microsoft.com>;
>>> cifs-protocol at lists.samba.org; Samuel Cabrero (Samba)
>>> <scabrero at samba.org>
>>> Subject: [EXTERNAL] [MS-XCA] LZ77+ Huffman: questions about blocks
>>>
>>> hi Dochelp,
>>>
>>>
>>> Does the beginning of the second and subsequent blocks break the bitstream, 
>>> starting again at a byte boundary after the new Huffman table?
>>>
>>> The question is best explained by analogy to the way long lengths are handled 
>>> in matches. Suppose we have a match symbol in the middle of a bitstream, and 
>>> the match is a long one, requiring the reading of an extra byte:
>>>
>>>       ijklmnop  abcDEFgh [distance] qrs...
>>>                    |
>>>                    [match 1, 15]
>>>
>>> Here abc, ghi.. are the sequence of bits in the stream around the match DEF, 
>>> which is read in alternating bytes by little-endian rules, and the distance 
>>> is plonked in the middle of the stream as an individual byte. The stream just 
>>> flows around it, so gh-ijklmnop are interpreted after [distance].
>>>
>>> Now, if DEF instead ended the block:
>>>
>>>       ijklmnop  abcDEFgh [new Huffman table] qrs...
>>>                    |
>>>                    [ends the block (64k)]
>>>
>>>
>>> would the bits gh-jklmnop be interpreted using the new Huffman table, as part 
>>> of the new block, or would those bits be dropped?
>>>
>>> Multi-block examples would of course be helpful.
>>>
>>>
>>> Douglas
>>>
>>>
>>
> 
> 
> _______________________________________________
> cifs-protocol mailing list
> cifs-protocol at lists.samba.org
> https://lists.samba.org/mailman/listinfo/cifs-protocol




More information about the cifs-protocol mailing list