utf8 vs ucs2

Michael Sweet mike at easysw.com
Tue May 22 13:58:36 GMT 2001


Andrew Tridgell wrote:
> ...
> so you end up essentially doing the utf8->ucs2 conversion, which is
> what I was proposing.
> 
> What I wondered is if there is a fast way to do it direct in utf8. I
> suspect there isn't, but I have only a little experience with utf8.

You can at least "fast path" the ASCII characters, and then only
do the UTF-8 -> UCS2 conversion for chars that start with bit 7 set.
I would suspect that this would eliminate 99% of the overhead of
using UTF-8, and given that most UNIX's require UTF-8 for Unicode
filenames you might still want to have code that compares a UTF-8
string to a UCS2 string, for example. (otherwise you do a lot of
string conversions for nothin)

I'll come up with some examples and post them to the list...

> ...
> See map_table_lower() and map_table_upper() in lib/util_unistr.c
> and include/unicode_map_table{1,2}.h.
> 
> It uses a simple trick to make the table a bit sparse which saves some
> memory. The trick is also necessary for some compilers as the full
> table overflows their internal limits. By splitting it and not having
> the big zero region between characters 9450 and 64256 we save enough
> space for it to compile on all reasonable compilers.

We may be able to come up with a faster/smaller solution; I did a
quicky program that pairs the upper/lowercase unicode glyphs and
there *is* a pattern to the madness...

-- 
______________________________________________________________________
Michael Sweet, Easy Software Products                  mike at easysw.com
Printing Software for UNIX                       http://www.easysw.com




More information about the samba-technical mailing list