CIFS vs. NFS and other filesystems (was Client for Samba Networks)

Tue Dec 18 11:31:03 GMT 2001

On Tue, Dec 18, 2001 at 02:24:15PM -0500, Michael Sweet wrote:
> 
> Actually, they aren't that hard to deal with; the main issue with
> copies is to avoid copying a fragment of a UTF-8 sequence.  To find
> the first byte of a UTF-8 encoded character by looking backwards for
> the top bit to be 0 or the top two bits to be 11.
> 
> If you want to do anything like tolower(), etc., then you (obviously)
> have to decode the current UTF-8 sequence, but that is trivial
> compared to providing the 16+ bit lookup tables (if that's how you do
> it) for all of those functions/macros.
> 
> What *really* sucks is the 2x (or 4x) overhead you pay for with UCS-2
> and UCS-4 - it impacts the amount of memory that applications need
> (bloat bloat bloat), the amount of bandwidth used, all of the
> standard functions people use, etc.  UTF-8 (and other variable-length
> encodings) can support Unicode > 64k, too, which is a big reason
> why the IETF is pushing UTF-8 instead of UCS-2 to support Unicode.

I still maintain that it's easy to make mistakes programming with
multi-length characters. I should know, I've made most of them :-).

Now for storage on disk, or traversal on the wire, utf8 is great.
But when you read that stuff into program memory for manipulation, then
fixed length is the way to go (IMHO).

Jeremy.