[jcifs] UTF-16 vs. UCS-2?

Tue Jul 30 04:14:19 EST 2002

Thanks!

A lot of very good info, and a lot of stuff I didn't quite 'get' until
your message.

I did read (on the unicode.org site, I think) that for the BMP (Basic 
Multilingual Plane) there really is no difference between UCS-2LE and 
UTF-16LE, which jibes with what you have below.  There was a question
as to whether the SNIA doc should say UTF-16LE or UCS-2LE.  From what
I see in your message, it probably doesn't matter.

Thanks again!

Chris -)-----

On Mon, Jul 29, 2002 at 01:28:59AM -0400, Michael B. Allen wrote:
> On Sun, 28 Jul 2002 22:51:57 -0500
> "Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:
> 
> > Mike, et. al.,
> > 
> > Can someone explain the difference between UCS-2 and UTF-16?   The current 
> > SNIA doc says that the Unicode bit indicates the use of UTF-16, but I've 
> > been told that it should say that UCS-2LE is being used.
> > 
> > ...and I have no real idea what that means.
> 
> UCS-2  and  UTF-16  are both Unicode character encodings. UCS-2 is simply a
> short  representing the Unicode value. UTF-16 is a little different in that
> it  has  an  extension  mechanism where certain higher bits are reserved to
> indicate  "surrogates"  are  used  so  that  characters  above  0xFFFF  are
> represented by surrogate pairs (4 bytes for each char instead of 2). 
> 
>   http://czyborra.com/utf/#UTF-16
> 
> UTF-8 uses a very similar mechanism but it's actually a little cleaner.
> 
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> 
> And because both encodings are at least two bytes they can be big endian or
> little   endian.  The  proper  identifiers  are  UCS-2LE,  UCS-2BE,  UCS-2,
> UTF-16LE,  UTF-16BE,  and  UTF-16.  The  ones  without  the  LE/BE are host
> endian'd. Java are UTF-16 with byte order marks. NFSv4 is UTF-8.
> 
> Windows  NT  is  largely UTF-16LE which has lead a few people to believe by
> induction that CIFS is UTF-16LE. Samba folk say CIFS is UCS-2LE. I've never
> heard  of  any  evidence  in either direction. It could be that the UCS-2LE
> crowd  is getting they're information from the inside so they can't present
> "evidence".
> 
> None  of  this  matters  of  course.  CIFS  will  never  need  to represent
> characters  above  the  0xFFFF  plane  and  if it tried it would go BSOD or
> reboot  faster  than  you  could  say  "Passport".  See these in your httpd
> error_log?: 
> 
>   _vti_bin/..%5c../..%5c../..%5c../winnt/system32/cmd.exe
>   /..%5c../..%5c../..%5c/..Á^\../..Á^\../..Á^\../winnt/system32/cmd.exe
> 
> These  are  over  long UTF-8 sequences that cause IIS to drop it's pants. I
> know  of  one  command that causes NT to BSOD and another that causes NT to
> reboot and they both have to do with using Unicode. 
> 
> -- 
> A  program should be written to model the concepts of the task it
> performs rather than the physical world or a process because this
> maximizes  the  potential  for it to be applied to tasks that are
> conceptually  similar and more importantly to tasks that have not
> yet been conceived. 

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org