[cifs-protocol] [EXTERNAL] Re: [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block? - TrackingID#2211010040007989

Douglas Bagnall douglas.bagnall at catalyst.net.nz
Thu Nov 10 01:17:51 UTC 2022


hi Obaid,

Thanks!

As this thread has converged to a large extent with #2210140040006030 
(https://lists.samba.org/archive/cifs-protocol/2022-November/003829.html), we 
should probably shut down one or the other.

I have not observed the 1<<26 barrier, though Jeff also mentions it; in any case 
I'm glad I can ignore it for the protocol stuff.

Douglas

On 10/11/22 11:13, Obaid Farooqi wrote:
> Hi Douglas:
> The 64k block size for XPRESS_HUFF only applies when the length of maximum match is less than 64k. If a match is greater than 64k long, XPRESS_HUFF, during LZ phase, keeps going till either the match ends or the input ends. Since the maximum input size is unsigned long (0xffffffff), a maximum match of 0xffffffff - 3 is allowed and will still be compressed as one block.
> 
> The 64k block size comes into picture when LZ phase breaks out of  matching any given match. After that Huffman frequencies are updated. At this stage, block size is checked for 64k limit
> 
> Another thing about the compress API that you are using is that it also does its one chunking. The blocks it creates are of size 1<<26 (4000000000) bytes, then it calls RtlCompressBuffer, which does the 64k chunking (i.e., if the maximum match is less than 64k).
> 
> Here is an example
> Note size of test.txt and it compressed version test.compressed
> 11/09/2022  02:46 PM                       267 test.compressed
> 11/09/2022  11:20 AM       509,038,866 test.txt
> 
> test.text contains "abc" repeated.
> 
> Here is hex dump of test.compressed
> 
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 30 23 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> A8 DC 00 00 FF 00 00 0C 51 57 1E
> 
> On the other hand, look at the following sizes:
> 
> 11/03/2022  01:37 AM           250,806 ms-xca
> 11/09/2022  11:30 AM           244,083 ms-xca.compressed
> 
> Since ms-xca does not have long matches, the compressed version is not much smaller than the original file.
> 
> I used RtlCompressBuffer directly for the two examples above.
> 
> I have file a bug against MS-XCA to include this nuance in it.
> 
> Please let me know it does not answer your question.
> 
> 
> Regards,
> Obaid Farooqi
> Escalation Engineer | Microsoft
> 
> -----Original Message-----
> From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
> Sent: Wednesday, November 2, 2022 3:06 PM
> To: Obaid Farooqi <obaidf at microsoft.com>
> Cc: cifs-protocol at lists.samba.org
> Subject: [EXTERNAL] Re: [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block? - TrackingID#2211010040007989
> 
> Hi Obaid,
> 
> I added the compressed and uncompressed files.
> 
> Note that the Compress API does not add the 28 byte header with the COMPRESS_RAW flag, which I have been using.
> 
> cheers,
> Douglas
> 
> 
> On 3/11/22 06:23, Obaid Farooqi wrote:
>> [Kristian to Bcc]
>> Hi Douglas:
>> I will help you with this issue and will be in touch as soon as possible.
>>    Is it possible to send me the file you're compressing? If it is, please upload the file to the following link:
>>
>>
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupp
>> ort.microsoft.com%2Ffiles%3Fworkspace%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJSUz
>> I1NiJ9.eyJ3c2lkIjoiOWQ5ZjVlMmUtYjEyMS00YTdiLTk2ZmMtMTUyMDYxNDdkYzFkIiw
>> ic3IiOiIyMjExMDEwMDQwMDA3OTg5IiwiYXBwaWQiOiI0ZTc2ODkxZC04NDUwLTRlNWUtY
>> mUzOC1lYTNiZDZlZjIxZTUiLCJzdiI6InYxIiwicnMiOiJFeHRlcm5hbCIsInd0aWQiOiI
>> yMzYwODJlMS02ODdhLTQ1NWYtYTRkZi1hNTM4ZWJhYzM0OWUiLCJpc3MiOiJodHRwczovL
>> 2FwaS5kdG1uZWJ1bGEubWljcm9zb2Z0LmNvbSIsImF1ZCI6Imh0dHA6Ly9zbWMiLCJleHA
>> iOjE2NzUxODUyODgsIm5iZiI6MTY2NzQwOTI4OH0.m0wKAbmuSvH-DNQCYnf9nB19YX3P_
>> B_OEwvYuia3_4IHN5tnEW9qs2KHp2Cru3V_AstjaKiUbKzgCuKhAtJwWjf6I7NpbBL2XUC
>> 6vzPSIclN_5B-ab0q67ZpOgr0Z-324Q79o7na8Iz5wtFOZCQMJkffn4zpMlArTITY77ArL
>> P-j3PTluUktmnd-u04celCsBPhYsyI76iUzWirA0xCX5kSXgMGi6dkWo5UHTSWgJJThg8f
>> rH4Ti6buz_6cS_8bwm1O_VSikLJ1k0nrDMc7l9QskL-BXW3dNwtkx_Y5rPZA2-anjWiCJ7
>> h1Z5qKoT7D9CAZZ812QvTW-7IWBRhBr_Q%26wid%3D9d9f5e2e-b121-4a7b-96fc-1520
>> 6147dc1d&data=05%7C01%7Cobaidf%40microsoft.com%7C5c84e9da370e437e3
>> c6008dabd0da75a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638030163
>> 594319800%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIi
>> LCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=W6Lh8LQCndh5F
>> 9JD8LEz79tcQEhBUPQaJ0QvoLHbnSk%3D&reserved=0
>>
>> Username: 2211010040007989_noemail at dtmxfer.onmicrosoft.com
>> Password: E at tM$$Sd
>>
>> The API Compress/Decompress is a wrapper around RtlCompressBuffer/RtlDecompressBuffer. It adds a 28 bytes header which has info like the algorithm and engine used, size of original data, checksum etc. It also has checks to not compress if the compressed size is greater than the original, which you experienced with 300 bytes string from MS-XCA. After the first 28 bytes, the output of Compress and RtlCompressBuffer should be same if Compress decides to compress.
>>
>> Regards,
>> Obaid Farooqi
>> Escalation Engineer | Microsoft
>>
>> -----Original Message-----
>> From: Kristian Smith <Kristian.Smith at microsoft.com>
>> Sent: Tuesday, November 1, 2022 5:33 PM
>> To: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
>> Cc: cifs-protocol at lists.samba.org
>> Subject: [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k
>> encoded in as block? - TrackingID#2211010040007989
>>
>> [DocHelp to Bcc]
>>
>> Hi Douglas,
>>
>> Thank you for your request. The case number 2211010040007989 has been created for this inquiry. One of our team members will follow-up with you soon.
>>
>> Regards,
>>
>> Kristian Smith
>> Support Escalation Engineer
>> Windows Open Spec Protocols
>> Office: (425) 421-4442
>> kristian.smith at microsoft.com
>>
>>
>> -----Original Message-----
>> From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
>> Sent: Tuesday, November 1, 2022 1:56 PM
>> To: cifs-protocol at lists.samba.org; Interoperability Documentation Help
>> <dochelp at microsoft.com>
>> Subject: [EXTERNAL] [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block?
>>
>> hi Dochelp,
>>
>> Is it ever the case that sometimes slightly more than 65536 bytes are encoded as a single block (i.e., using one Huffman table)?
>>
>> I ask because I observe this behaviour with the user mode Windows Compression API, which I know is not covered by MS-XCA, but which purports to use the same algorithm.
>>
>> As a specific example, when compressing a string of 65537 (i.e. 64k + 1) zeros, I get the following result:
>>
>> The Huffman table is all zeros except bytes 0, 128, and 135, which are 0x02, 0x02, and 0x10 respectively.
>>
>> symbol  code    Huffman    meaning
>> 0x00      2      10        literal zero
>> 0x100     2      11        EOF
>> 0x10f     1      0         match 1 back, length TBD (>17)
>>
>> The remaining bytes are 00 98 00 00 ff fd ff.
>> The 0x98 is 10-0-11-000, encoding literal zero, the 0x10f match, then EOF.
>>
>> The length for the match resolves to 0xfffd + 3, which is exactly 0x10000, or 65536.
>>
>> That all works very nicely, writing one zero, then copying it 65536 times for the result we want, but it breaks the rule that data is processed in 64k chunks.
>>
>> As I read it, the way MS-XCA would handle this is to have two blocks. The first would look very much like the one described above, but with an 0xfc byte in place of 0xfd, indicating a total of 65536 zeros, and no EOF. The second would have only a single zero and EOF.
>>
>> MS-XCA 2.1.4.3 does seem to suggest > 65536 bytes in a block when it says:
>>
>>> Note that match distances cannot be larger than 65,535, and match lengths > cannot be longer than 65,538.
>>
>> If the block is always 65536 or less, why mention lengths of 65,538?
>>
>> The follow up question is going to be: how can a decoder know when the block length is greater than 64k?
>>
>> cheers,
>> Douglas
>>
> 




More information about the cifs-protocol mailing list