[Draft #2] Samba 3.0 roadmap...idmap storage & central idmap repository

Simo Sorce simo.sorce at xsec.it
Tue Jul 9 13:16:05 GMT 2002


On Tue, 2002-07-09 at 21:32, Michael Sweet wrote:
> Simo Sorce wrote:
> > Hi metze,
> > on top of the first doc I see you state that all strings should be utf8.
> > I hearteadly disagree, I woul d rather like to see all internal strings
> > on new code to be UCS-2.
> > Utf8 has many disadvantages:
> > 1. require any RPC requests that comes from clients to be converted
> > forth and back (UCS-2->UTF8->UCS-2)
> 
> Some "conversion" will always be required, not only for byte order
> issues (remember that UCS-2 strings can contain byte-order overrides)
> but for normalization forms that may be required.

That will be already in place anyway for non-little endian machines so
no overhead there.

> Also, some SMB clients are using UTF-16 now (superset of UCS-2 to
> support code points in other Unicode planes) instead of UCS-2.

which clients?

> Finally, most UNIX filesystems only support the UTF-8 representation
> of Unicode, so at some point UCS-2/UTF-16 will have to be converted
> to UTF-8 anyways...

yep, that will be in the new VFS with NTFS semantics interface, not all
over the code.

> > 2. Is difficult to manipulate UTF8 strings as they are variable lenght
> > multibyte chars and sometimes uppercase chars have different lenght than
> > lowercase chars.
> > ...

ah not mentioning char and string searches inside a string.

> UCS-2 can have different byte orders, and with UTF-16 you also need
> to keep track of the current plane as well, which makes life even
> more fun.

not in CIFS world currently, we are forced to be little endian in
practice. never seen any smb client that use UTF-16. However
manipulation of UCS2 strings (null-word terminated) is way more easy and
fast than manipulation of UTF strings, so I really think that it is the
way to go.

> In addition, no matter what Unicode representation is used, you
> still have to deal with different representations of the "same"
> character (is it a single character "a" with an umlat, or "a"
> plus a combining umlat character?, etc.)

If for that problem it does not matter which rep to use, than better go
with the one that ease programming (and easily avoid lots of errors,
specially in inside-string character or string search and
uppercasing/lowercasing).

Simo.

-- 
Simo Sorce - simo.sorce at xsec.it
Xsec s.r.l.
via Durando 10 Ed. G - 20158 - Milano
tel. +39 02 2399 7130 - fax: +39 02 700 442 399
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: This is a digitally signed message part
Url : http://lists.samba.org/archive/samba-technical/attachments/20020709/9dd4a9b1/attachment.bin


More information about the samba-technical mailing list