utf8 vs ucs2

Andrew Tridgell tridge at samba.org
Tue May 22 13:27:34 GMT 2001


Mike,

> Right, I'm just thinking that once you have "decoded" the RPC message
> you probably want everything in native word order so you can use the
> "standard" UCS2 string functions.  

yes, I think that probably is best, but some people have argued for
supporting both byte orders in Sambas unicode strings before. I didn't
find it really convincing at the time but I wasn't all that interested
in unicode at the time either so I didn't listen much.

Jeremy, do you remember the arguments for having internal strings in
the byte order that the client negotiates? What did we decide about
that? I remember we talked about it at the last cifs con, but I don't
remember the details.

> It isn't *too* bad, but there is definitely more CPU overhead than
> processing 16-bit chars.  What you end up doing is building the 16-bit
> (actually up to 28-bits) Unicode value from each string, and then
> doing the comparison or case shift.

so you end up essentially doing the utf8->ucs2 conversion, which is
what I was proposing.

What I wondered is if there is a fast way to do it direct in utf8. I
suspect there isn't, but I have only a little experience with utf8.

> Mapping non-ASCII characters to
> upper/lowercase can be a pain (you certainly don't want a 64k LUT),
> but if you just want to handle case for ASCII characters then the
> standard functions (including strcasecmp) will work without change
> with the UTF-8 strings.

we already have a big (sparse) table that tells us the upper/lower
mappings for ucs2. We'll just keep that table unless someone has a
better way. We really do need to get this right for all ucs2 chars,
not just 7 bit ascii. Using a SMB server that doesn't get the case
sensitivity issues right is really not very nice.

> I'll play around a bit; the nice thing is that once we have the
> upper/lower mapping for Unicode, we can use it for UTF-8 or for
> UCS2...

See map_table_lower() and map_table_upper() in lib/util_unistr.c 
and include/unicode_map_table{1,2}.h.

It uses a simple trick to make the table a bit sparse which saves some
memory. The trick is also necessary for some compilers as the full
table overflows their internal limits. By splitting it and not having
the big zero region between characters 9450 and 64256 we save enough
space for it to compile on all reasonable compilers.

I really wish iconv() knew about upper/lower case of characters!





More information about the samba-technical mailing list