i18n question.

Benjamin Riefenstahl Benjamin.Riefenstahl at epost.de
Sun Mar 7 12:27:44 GMT 2004


Hi,


Kenichi Okuyama <okuyamak at dd.iij4u.or.jp> writes:
> For example, Cyrillics. I've learned from other Japanese people that
> current 3.x Cyrillic case-insensitive code converts from UTF-8 to
> UCS2, then handle whatever nessasary, and re-convert back to UTF-8.
> This is happening for Greek and other languages too.  You call this
> think 'good performance'? I don't.  Do we have better way? No as
> long as we use UTF-8.  What is better solution? use UTF-16!

Sure there are better ways.  UTF-8 encodes characters below U+FFFF in
at most 3 bytes.  You can detect if you are potentially dealing with
Cyrillic or Greek or another case-mapping script from the first byte
and do a limited 16-bit mapping using the other two bytes (even
without conversion to UTF-16).  So you do not need to convert all text
to UTF-16, and you can even play some tricks to limit the table for
the ranges where you actually have to do anything.

Compare that to handling the same thing in UCS2 (UTF-16 or UCS4 seems
not really interesting to me at the moment).  You either map the whole
range from U+0000 to U+FFFF which costs 128KB of memory, or you do a
similar algorithm of segmentation and just mapping the critical
sections with limited tables.  I'd usually prefer the second approach
but that's probably only slightly more efficient than doing it in
UTF-8 with the method outlined above.

You do get some moderatly complicated tables and non-intuitive code.
So you only want to do this kind of thing, if you have either
determined that you have a speed problem in real life when you use the
simpler methods, or if you are writing a library or system module
(which would include Samba for the sake of this topic).


benny



More information about the samba-technical mailing list