i18n question.

Mon Mar 8 05:02:14 GMT 2004

Shiro,

 > Under the current implementation, a multibyte string manipulation is
 > done with respect to UCS2. Whenever string standardisations, comparisons
 > and substitutions are necessary, firstly the function assumes the
 > string is in ascii, and when it gets non-ascii code due course it
 > throws away the work done so far, converts the entire string into
 > UCS2, perform string operations in UCS2, convert back to unix charset.
 > Now I haven't done any performance testings yet, but it is certainly
 > slow operation. 

In an earlier email I proposed allowing function-pointer hooks to
replace these core functions. We will have to do some work to get the
interfaces to these functions right, but that should not be too
hard. Then we would have a default set of functions that use the
current fast/slow method, with an alternative set of functions that
uses some (as yet undecided) method. We could either use a module or a
normal smb.conf parameter to choose the method.

We would rely on people such as yourself, Kenichi and Monyo who are
actively using Samba in multi-byte environments to supply us with
these alternative methods. I can't see any reason why you couldn't
make these very fast. They could even be charset specific if need be.

If you end up creating these methods in such a way that they are as
fast (or nearly as fast) as the fast path methods we currently use for
7 bit characters then we could even make your methods the
default. There is a nice challenge for you :)

 > This is one of the biggest reason we want to fix internal codeset
 > to UCS2, as it is capable of manipulating string in consistant way,
 > regardless of whether the character set is in MB or not.

While we are using portable ANSI-C this just isn't practical, quite
apart from the fact that UCS-2 is dead, and we would have to use
UTF-16 instead unless we wanted to alienate some other language
groups.

 > At the end of the day, we want the least amount of conversions as
 > possible. If your compromise is based on the argument that the current
 > two-steps string manipulation, fast-path and (very) slow-path method
 > stays asis basis, then that is adding one extra conversion (+ overhead
 > of calling VFS module) and I don't believe that is solving the root
 > of this problem.

There is more than one "root" of this problem. Some complaints
are about performance, while others are about functionality. I believe
both problems can be solved while retaining the current 
"unix charset" == "internal charset" method.

You are right that my proposed charset translation module adds an
extra convert on every call, but please remember that the alternative
proposal of making "internal charset" != "unix charset" would add this
extra convert for _every_ user of Samba, whereas my proposal would
only add the extra convert for people who configure multiple different
charsets on the one server, and even then would only happen on shares
that don't use the default charset. I think it is pretty obvious which
is preferable, both in terms of complexity and performance.

Cheers, Tridge