utf8 vs ucs2

Michael Sweet mike at easysw.com
Tue May 22 11:53:42 GMT 2001


Andrew Tridgell wrote:
> ...
> yes, I agree. We still need separate rpcstr_*() functions though, as
> it needs to look at the negotiated format.

Right, I'm just thinking that once you have "decoded" the RPC message
you probably want everything in native word order so you can use the
"standard" UCS2 string functions.  Basically:

    rpcstr_to_native(to, from, len);
    rpcstr_to_wire(to, from, len);

or even just an in-place swap (to and from the same) that is conditional
based on the word order of the client.

> ...
> My concern with using utf8 long term is that the wildcard code and
> strupper/strlower/strcasecmp are very hard (and probably very slow) in
> utf8. The lack of a uniform character size makes that sort of
> manipulation extremely tedios.

It isn't *too* bad, but there is definitely more CPU overhead than
processing 16-bit chars.  What you end up doing is building the 16-bit
(actually up to 28-bits) Unicode value from each string, and then
doing the comparison or case shift.  Mapping non-ASCII characters to
upper/lowercase can be a pain (you certainly don't want a 64k LUT),
but if you just want to handle case for ASCII characters then the
standard functions (including strcasecmp) will work without change
with the UTF-8 strings.

> ...
> Or maybe someone knows a clever plan for doing
> strupper/strlower/strcasecmp directly in utf8 quickly? Have any other
> projects managed that? If there is a way then that would change things
> dramatically.

I'll play around a bit; the nice thing is that once we have the
upper/lower
mapping for Unicode, we can use it for UTF-8 or for UCS2...

-- 
______________________________________________________________________
Michael Sweet, Easy Software Products                  mike at easysw.com
Printing Software for UNIX                       http://www.easysw.com




More information about the samba-technical mailing list