[jcifs] UTF-16 vs. UCS-2?

Mon Jul 29 15:28:59 EST 2002

On Sun, 28 Jul 2002 22:51:57 -0500
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:

> Mike, et. al.,
> 
> Can someone explain the difference between UCS-2 and UTF-16?   The current 
> SNIA doc says that the Unicode bit indicates the use of UTF-16, but I've 
> been told that it should say that UCS-2LE is being used.
> 
> ...and I have no real idea what that means.

UCS-2  and  UTF-16  are both Unicode character encodings. UCS-2 is simply a
short  representing the Unicode value. UTF-16 is a little different in that
it  has  an  extension  mechanism where certain higher bits are reserved to
indicate  "surrogates"  are  used  so  that  characters  above  0xFFFF  are
represented by surrogate pairs (4 bytes for each char instead of 2). 

  http://czyborra.com/utf/#UTF-16

UTF-8 uses a very similar mechanism but it's actually a little cleaner.

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

And because both encodings are at least two bytes they can be big endian or
little   endian.  The  proper  identifiers  are  UCS-2LE,  UCS-2BE,  UCS-2,
UTF-16LE,  UTF-16BE,  and  UTF-16.  The  ones  without  the  LE/BE are host
endian'd. Java are UTF-16 with byte order marks. NFSv4 is UTF-8.

Windows  NT  is  largely UTF-16LE which has lead a few people to believe by
induction that CIFS is UTF-16LE. Samba folk say CIFS is UCS-2LE. I've never
heard  of  any  evidence  in either direction. It could be that the UCS-2LE
crowd  is getting they're information from the inside so they can't present
"evidence".

None  of  this  matters  of  course.  CIFS  will  never  need  to represent
characters  above  the  0xFFFF  plane  and  if it tried it would go BSOD or
reboot  faster  than  you  could  say  "Passport".  See these in your httpd
error_log?: 

  _vti_bin/..%5c../..%5c../..%5c../winnt/system32/cmd.exe
  /..%5c../..%5c../..%5c/..Á^\../..Á^\../..Á^\../winnt/system32/cmd.exe

These  are  over  long UTF-8 sequences that cause IIS to drop it's pants. I
know  of  one  command that causes NT to BSOD and another that causes NT to
reboot and they both have to do with using Unicode. 

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and more importantly to tasks that have not
yet been conceived.