[jcifs] UTF-16 vs. UCS-2?
Christopher R. Hertel
crh at ubiqx.mn.org
Tue Jul 30 04:14:19 EST 2002
Thanks!
A lot of very good info, and a lot of stuff I didn't quite 'get' until
your message.
I did read (on the unicode.org site, I think) that for the BMP (Basic
Multilingual Plane) there really is no difference between UCS-2LE and
UTF-16LE, which jibes with what you have below. There was a question
as to whether the SNIA doc should say UTF-16LE or UCS-2LE. From what
I see in your message, it probably doesn't matter.
Thanks again!
Chris -)-----
On Mon, Jul 29, 2002 at 01:28:59AM -0400, Michael B. Allen wrote:
> On Sun, 28 Jul 2002 22:51:57 -0500
> "Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:
>
> > Mike, et. al.,
> >
> > Can someone explain the difference between UCS-2 and UTF-16? The current
> > SNIA doc says that the Unicode bit indicates the use of UTF-16, but I've
> > been told that it should say that UCS-2LE is being used.
> >
> > ...and I have no real idea what that means.
>
> UCS-2 and UTF-16 are both Unicode character encodings. UCS-2 is simply a
> short representing the Unicode value. UTF-16 is a little different in that
> it has an extension mechanism where certain higher bits are reserved to
> indicate "surrogates" are used so that characters above 0xFFFF are
> represented by surrogate pairs (4 bytes for each char instead of 2).
>
> http://czyborra.com/utf/#UTF-16
>
> UTF-8 uses a very similar mechanism but it's actually a little cleaner.
>
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
>
> And because both encodings are at least two bytes they can be big endian or
> little endian. The proper identifiers are UCS-2LE, UCS-2BE, UCS-2,
> UTF-16LE, UTF-16BE, and UTF-16. The ones without the LE/BE are host
> endian'd. Java are UTF-16 with byte order marks. NFSv4 is UTF-8.
>
> Windows NT is largely UTF-16LE which has lead a few people to believe by
> induction that CIFS is UTF-16LE. Samba folk say CIFS is UCS-2LE. I've never
> heard of any evidence in either direction. It could be that the UCS-2LE
> crowd is getting they're information from the inside so they can't present
> "evidence".
>
> None of this matters of course. CIFS will never need to represent
> characters above the 0xFFFF plane and if it tried it would go BSOD or
> reboot faster than you could say "Passport". See these in your httpd
> error_log?:
>
> _vti_bin/..%5c../..%5c../..%5c../winnt/system32/cmd.exe
> /..%5c../..%5c../..%5c/..Á^\../..Á^\../..Á^\../winnt/system32/cmd.exe
>
> These are over long UTF-8 sequences that cause IIS to drop it's pants. I
> know of one command that causes NT to BSOD and another that causes NT to
> reboot and they both have to do with using Unicode.
>
> --
> A program should be written to model the concepts of the task it
> performs rather than the physical world or a process because this
> maximizes the potential for it to be applied to tasks that are
> conceptually similar and more importantly to tasks that have not
> yet been conceived.
--
Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org
More information about the jcifs
mailing list