i18n question.

tridge at samba.org tridge at samba.org
Fri Mar 19 10:31:54 GMT 2004


Monyo,

 > Since as Kenichi says, usage of such characters basically depends on
 > people and I do not have the statistical information, I think the
 > usage is relatively rare because we have lots of filenames containing
 > ASCII only and if we use Japanese filename, it is often consist of
 > KANJI, HIRAGANA and KATAKANA characters only. Using fullwidth Cyllic,
 > Greece is quite rare. Using fullwidth Alphabet, Roman numerals are
 > also rare.

ok, so it sounds like this optimisation will be worthwhile for a
fairly large group of people.

The next step is to get an accurate and fast caseless_index() function
for the important character sets (at least UTF8 and Big5 will be
important here). Then we should extend "struct charset_functions" to
have a method called caseless_index() and implement a default
caseless_index() function that uses iconv() and a unicode charset
(either UTF8 or UCS2, depending on if we can make the UTF8
caseless_index() function fast).

One thing we need to test is if any of the UTF-16 characters that
cannot be represented in UCS-2 have case. We will need a new
smbtorture test program that works with UTF-16 to test this.

Monyo, can you or someone else in the Japanese Samba group write a
caseless_index() function for Big5? If you can also write a
case-insensitive strcmp() function directly in Big5 that would be
great as well. I think we should put a 
	compare_string(const char *, const char *);
function in "struct charset_functions" as well. Then we will extend
the lib/iconv.c API to implement string comparison.

We should probably create a separate directory source/lib/iconv/ and
start putting the charset specific functions in there. For example we
could have source/lib/iconv/big5.c and source/lib/iconv/utf8.c plus
whatever other direct character set modules are important. Obviously
we will still support all character sets that normal iconv() supports
via conversion, but I think it makes sense to make non-converting
functions for the most widely used (and complex) charsets like Big5.

No rush on this of course, Samba4 is still a fair way off!

Cheers, Tridge


More information about the samba-technical mailing list