CIFS vs. NFS and other filesystems (was Client for Samba Networks)

Tue Dec 18 11:25:02 GMT 2001

On Tuesday 18 December 2001 02:00 pm, Jeremy Allison wrote:
> ...
> Multi-length characters suck. Period. Hard to program, lead to buffer
> overruns, hard to traverse in reverse....

Actually, they aren't that hard to deal with; the main issue with
copies is to avoid copying a fragment of a UTF-8 sequence.  To find
the first byte of a UTF-8 encoded character by looking backwards for
the top bit to be 0 or the top two bits to be 11.

If you want to do anything like tolower(), etc., then you (obviously)
have to decode the current UTF-8 sequence, but that is trivial
compared to providing the 16+ bit lookup tables (if that's how you do
it) for all of those functions/macros.

What *really* sucks is the 2x (or 4x) overhead you pay for with UCS-2
and UCS-4 - it impacts the amount of memory that applications need
(bloat bloat bloat), the amount of bandwidth used, all of the
standard functions people use, etc.  UTF-8 (and other variable-length
encodings) can support Unicode > 64k, too, which is a big reason
why the IETF is pushing UTF-8 instead of UCS-2 to support Unicode.

-- 
______________________________________________________________________
Michael Sweet, Easy Software Products                  mike at easysw.com
Printing Software for UNIX                       http://www.easysw.com