utf8 vs ucs2

Andrew Tridgell tridge at samba.org
Tue May 22 00:35:24 GMT 2001


> Since byte-swapping is pretty fast, and (IIRC) you are copying the
> RPC structures by hand anyways, it might pay to convert the UCS2
> chars to the native word order when used internally, and then just
> convert back (as needed) when a message is sent back.

yes, I agree. We still need separate rpcstr_*() functions though, as
it needs to look at the negotiated format.

> Personally, I'd like to see everything done in UTF-8 to minimize
> the memory usage impact, but given that RPC and SMB use UCS2
> almost exclusively it makes more sense to use UCS2 internally
> than UTF-8.

My main concern isn't memory usage, we already waste oodles of ram
on strings because of the pstring/fstring stuff (that needs fixing as
well, but I'll leave that for a separate discussion).

My concern with using utf8 long term is that the wildcard code and
strupper/strlower/strcasecmp are very hard (and probably very slow) in
utf8. The lack of a uniform character size makes that sort of
manipulation extremely tedios. 

The only sane way I can see to do strupper/strlower/strcasecmp and the
wildcard matching is directly in ucs2. Unfortunately a lot of SMB
calls rely on one of these functions, so we will be constantly
converting utf8<->ucs2. That is acceptable for an intermediate step of
a long term plan, but I would hate to see us doing that for ever. It
would just be too slow.

Or maybe someone knows a clever plan for doing
strupper/strlower/strcasecmp directly in utf8 quickly? Have any other
projects managed that? If there is a way then that would change things
dramatically. 

Obviously you can do it fast for the special case of 7 bit languages
like english, and we'd put in that optimisation so the impact isn't so
large on most users while the conversion is happening, but if there is
a way to make it fast for multi-byte langauges that would be great.

> Another area that needs attention is localization - right now
> none of the SAMBA client code is localized.

indeed, that does need doing, but I'd like to keep it as a completely
separate issue. One of the problems we have had with addressing string
handling in the past is that we have tried to solve all the problems
at once, and that was too hard. I see string handling in Samba as
having 3 problems:

1) incorrect behaviour in SMB for multi-byte languages (eg. right now
   our wildcard code is broken for multi-byte)

2) dynamic string allocation rather than pstring/fstring

3) localisation of messages

I'm concentrating on (1) at the moment. I think we leave localisation
until we sort out the other two (as the way we localise will probably
depend on how the others turn out).

Cheers, Tridge




More information about the samba-technical mailing list