[cifs-protocol] [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block? - TrackingID#2211010040007989

Wed Nov 2 17:23:25 UTC 2022

[Kristian to Bcc]
Hi Douglas:
I will help you with this issue and will be in touch as soon as possible.
 Is it possible to send me the file you're compressing? If it is, please upload the file to the following link:

https://support.microsoft.com/files?workspace=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ3c2lkIjoiOWQ5ZjVlMmUtYjEyMS00YTdiLTk2ZmMtMTUyMDYxNDdkYzFkIiwic3IiOiIyMjExMDEwMDQwMDA3OTg5IiwiYXBwaWQiOiI0ZTc2ODkxZC04NDUwLTRlNWUtYmUzOC1lYTNiZDZlZjIxZTUiLCJzdiI6InYxIiwicnMiOiJFeHRlcm5hbCIsInd0aWQiOiIyMzYwODJlMS02ODdhLTQ1NWYtYTRkZi1hNTM4ZWJhYzM0OWUiLCJpc3MiOiJodHRwczovL2FwaS5kdG1uZWJ1bGEubWljcm9zb2Z0LmNvbSIsImF1ZCI6Imh0dHA6Ly9zbWMiLCJleHAiOjE2NzUxODUyODgsIm5iZiI6MTY2NzQwOTI4OH0.m0wKAbmuSvH-DNQCYnf9nB19YX3P_B_OEwvYuia3_4IHN5tnEW9qs2KHp2Cru3V_AstjaKiUbKzgCuKhAtJwWjf6I7NpbBL2XUC6vzPSIclN_5B-ab0q67ZpOgr0Z-324Q79o7na8Iz5wtFOZCQMJkffn4zpMlArTITY77ArLP-j3PTluUktmnd-u04celCsBPhYsyI76iUzWirA0xCX5kSXgMGi6dkWo5UHTSWgJJThg8frH4Ti6buz_6cS_8bwm1O_VSikLJ1k0nrDMc7l9QskL-BXW3dNwtkx_Y5rPZA2-anjWiCJ7h1Z5qKoT7D9CAZZ812QvTW-7IWBRhBr_Q&wid=9d9f5e2e-b121-4a7b-96fc-15206147dc1d

Username: 2211010040007989_noemail at dtmxfer.onmicrosoft.com
Password: E at tM$$Sd

The API Compress/Decompress is a wrapper around RtlCompressBuffer/RtlDecompressBuffer. It adds a 28 bytes header which has info like the algorithm and engine used, size of original data, checksum etc. It also has checks to not compress if the compressed size is greater than the original, which you experienced with 300 bytes string from MS-XCA. After the first 28 bytes, the output of Compress and RtlCompressBuffer should be same if Compress decides to compress.

Regards,
Obaid Farooqi
Escalation Engineer | Microsoft

-----Original Message-----
From: Kristian Smith <Kristian.Smith at microsoft.com> 
Sent: Tuesday, November 1, 2022 5:33 PM
To: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
Cc: cifs-protocol at lists.samba.org
Subject: [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block? - TrackingID#2211010040007989

[DocHelp to Bcc]

Hi Douglas,

Thank you for your request. The case number 2211010040007989 has been created for this inquiry. One of our team members will follow-up with you soon.

Regards,

Kristian Smith
Support Escalation Engineer
Windows Open Spec Protocols
Office: (425) 421-4442
kristian.smith at microsoft.com 

-----Original Message-----
From: Douglas Bagnall <douglas.bagnall at catalyst.net.nz> 
Sent: Tuesday, November 1, 2022 1:56 PM
To: cifs-protocol at lists.samba.org; Interoperability Documentation Help <dochelp at microsoft.com>
Subject: [EXTERNAL] [MS-XCA] LZ77 + Huffman: is sometimes slightly more than 64k encoded in as block?

hi Dochelp,

Is it ever the case that sometimes slightly more than 65536 bytes are encoded as a single block (i.e., using one Huffman table)?

I ask because I observe this behaviour with the user mode Windows Compression API, which I know is not covered by MS-XCA, but which purports to use the same algorithm.

As a specific example, when compressing a string of 65537 (i.e. 64k + 1) zeros, I get the following result:

The Huffman table is all zeros except bytes 0, 128, and 135, which are 0x02, 0x02, and 0x10 respectively.

symbol  code    Huffman    meaning
0x00      2      10        literal zero
0x100     2      11        EOF
0x10f     1      0         match 1 back, length TBD (>17)

The remaining bytes are 00 98 00 00 ff fd ff.
The 0x98 is 10-0-11-000, encoding literal zero, the 0x10f match, then EOF.

The length for the match resolves to 0xfffd + 3, which is exactly 0x10000, or 65536.

That all works very nicely, writing one zero, then copying it 65536 times for the result we want, but it breaks the rule that data is processed in 64k chunks.

As I read it, the way MS-XCA would handle this is to have two blocks. The first would look very much like the one described above, but with an 0xfc byte in place of 0xfd, indicating a total of 65536 zeros, and no EOF. The second would have only a single zero and EOF.

MS-XCA 2.1.4.3 does seem to suggest > 65536 bytes in a block when it says:

> Note that match distances cannot be larger than 65,535, and match lengths > cannot be longer than 65,538.

If the block is always 65536 or less, why mention lengths of 65,538?

The follow up question is going to be: how can a decoder know when the block length is greater than 64k?

cheers,
Douglas