[jcifs] UTF-16 vs. UCS-2?
Michael B. Allen
miallen at eskimo.com
Mon Jul 29 15:28:59 EST 2002
On Sun, 28 Jul 2002 22:51:57 -0500
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:
> Mike, et. al.,
>
> Can someone explain the difference between UCS-2 and UTF-16? The current
> SNIA doc says that the Unicode bit indicates the use of UTF-16, but I've
> been told that it should say that UCS-2LE is being used.
>
> ...and I have no real idea what that means.
UCS-2 and UTF-16 are both Unicode character encodings. UCS-2 is simply a
short representing the Unicode value. UTF-16 is a little different in that
it has an extension mechanism where certain higher bits are reserved to
indicate "surrogates" are used so that characters above 0xFFFF are
represented by surrogate pairs (4 bytes for each char instead of 2).
http://czyborra.com/utf/#UTF-16
UTF-8 uses a very similar mechanism but it's actually a little cleaner.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
And because both encodings are at least two bytes they can be big endian or
little endian. The proper identifiers are UCS-2LE, UCS-2BE, UCS-2,
UTF-16LE, UTF-16BE, and UTF-16. The ones without the LE/BE are host
endian'd. Java are UTF-16 with byte order marks. NFSv4 is UTF-8.
Windows NT is largely UTF-16LE which has lead a few people to believe by
induction that CIFS is UTF-16LE. Samba folk say CIFS is UCS-2LE. I've never
heard of any evidence in either direction. It could be that the UCS-2LE
crowd is getting they're information from the inside so they can't present
"evidence".
None of this matters of course. CIFS will never need to represent
characters above the 0xFFFF plane and if it tried it would go BSOD or
reboot faster than you could say "Passport". See these in your httpd
error_log?:
_vti_bin/..%5c../..%5c../..%5c../winnt/system32/cmd.exe
/..%5c../..%5c../..%5c/..Á^\../..Á^\../..Á^\../winnt/system32/cmd.exe
These are over long UTF-8 sequences that cause IIS to drop it's pants. I
know of one command that causes NT to BSOD and another that causes NT to
reboot and they both have to do with using Unicode.
--
A program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes the potential for it to be applied to tasks that are
conceptually similar and more importantly to tasks that have not
yet been conceived.
More information about the jcifs
mailing list