utf8 vs ucs2

Michael Sweet mike at easysw.com
Mon May 21 23:40:24 GMT 2001

Andrew Tridgell wrote:
> ...
> Another interesting area is string handling in the RPC code. That
> will require a set of functions similar to srvstr_*() but not quite
> the same. In particular, the RPC code needs to cope with ucs2 in
> either big or little endian format (it is negotiated as part of RPC,
> but is not negotiated in the main SMB code). So we will probably
> create a set of rpcstr_() functions. Luckily the RPC code already
> has similar functions in all the right places (it has always needed
> them as it is always ucs2).

Since byte-swapping is pretty fast, and (IIRC) you are copying the
RPC structures by hand anyways, it might pay to convert the UCS2
chars to the native word order when used internally, and then just
convert back (as needed) when a message is sent back.  That way
you just have a single set of UCS2 string functions to manage, and
the most common (?) setup (a PC connecting to a Linux/Solaris server)
will have 0 performance impact.

Personally, I'd like to see everything done in UTF-8 to minimize
the memory usage impact, but given that RPC and SMB use UCS2
almost exclusively it makes more sense to use UCS2 internally
than UTF-8.


Another area that needs attention is localization - right now
none of the SAMBA client code is localized.  This will have to
change (obviously) if we really want to make SAMBA accessable
from all languages...  Here I think we want to use UTF-8 or
any of the 8-bit ISO character sets, so that the client programs
don't have to replace printf and co. with uprintf (or whatever
the UCS-2 equivalent would be) which would then have to discover
if stdin and stdout are 8-bit or 16-bit...

Time to come up with a language that contains two characters -
"foo" and "bar"... :)

Michael Sweet, Easy Software Products                  mike at easysw.com
Printing Software for UNIX                       http://www.easysw.com

More information about the samba-technical mailing list