CIFS vs. NFS and other filesystems (was Client for Samba Networks)
Jeremy Allison
jra at samba.org
Tue Dec 18 11:01:05 GMT 2001
On Tue, Dec 18, 2001 at 12:52:22PM -0600, Steven French wrote:
>
> "Unicode was originally designed as a pure 16-bit encoding, aimed at
> representing all modern scripts. (Ancient scripts were to be
> represented with private-use characters.) Over time, and especially
> after the addition of over 14,500 composite characters for
> compatibility with legacy sets, it became clear that 16-bits were not
> sufficient for the user community. Out of this arose UTF-16.
>
>
> UTF-16 allows access to 63K characters as single Unicode 16-bit units.
> It can access an additional 1M characters by a mechanism known as
> surrogate pairs. Two ranges of Unicode code values are reserved for
> the high (first) and low (second) values of these pairs. Highs are
> from 0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0,
> there are no assigned surrogate pairs. Since the most common
> characters have already been encoded in the first 64K values, the
> characters requiring surrogate pairs will be relatively rare (see
> below)."
Multi-length characters suck. Period. Hard to program, lead to buffer
overruns, hard to traverse in reverse....
UCS2 is at least a fixed length encoding. That's why I was so cross
with Apple for adding multi-length ucs2 encoding in their version of
Samba :-(.
If 16 bits isn't enough, then go to 32 bit. 2^32 characters should be
enough so that every weird-ass language (including Klingon) that needs
compose character pairs can select a unique codepoint for each pairing...
Human languages just aren't that complex.
Jeremy (variable-length-characters must die) Allison.
More information about the samba-technical
mailing list