CIFS vs. NFS and other filesystems (was Client for Samba Networks)

Jeremy Allison jra at samba.org
Tue Dec 18 11:01:05 GMT 2001


On Tue, Dec 18, 2001 at 12:52:22PM -0600, Steven French wrote:
>                                                                                
>         "Unicode was originally designed as a pure 16-bit encoding, aimed at   
>         representing all modern scripts. (Ancient scripts were to be           
>         represented with private-use characters.) Over time, and especially    
>         after the addition of over 14,500 composite characters for             
>         compatibility with legacy sets, it became clear that 16-bits were not  
>         sufficient for the user community. Out of this arose UTF-16.           
>                                                                                
>                                                                                
>         UTF-16 allows access to 63K characters as single Unicode 16-bit units. 
>         It can access an additional 1M characters by a mechanism known as      
>         surrogate pairs. Two ranges of Unicode code values are reserved for    
>         the high (first) and low (second) values of these pairs. Highs are     
>         from 0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, 
>         there are no assigned surrogate pairs. Since the most common           
>         characters have already been encoded in the first 64K values, the      
>         characters requiring surrogate pairs will be relatively rare (see      
>         below)."                                                               

Multi-length characters suck. Period. Hard to program, lead to buffer
overruns, hard to traverse in reverse....

UCS2 is at least a fixed length encoding. That's why I was so cross
with Apple for adding multi-length ucs2 encoding in their version of
Samba :-(.

If 16 bits isn't enough, then go to 32 bit. 2^32 characters should be
enough so that every weird-ass language (including Klingon) that needs
compose character pairs can select a unique codepoint for each pairing...

Human languages just aren't that complex.

Jeremy (variable-length-characters must die) Allison.




More information about the samba-technical mailing list