tridge at samba.org
tridge at samba.org
Fri Mar 19 10:31:54 GMT 2004
> Since as Kenichi says, usage of such characters basically depends on
> people and I do not have the statistical information, I think the
> usage is relatively rare because we have lots of filenames containing
> ASCII only and if we use Japanese filename, it is often consist of
> KANJI, HIRAGANA and KATAKANA characters only. Using fullwidth Cyllic,
> Greece is quite rare. Using fullwidth Alphabet, Roman numerals are
> also rare.
ok, so it sounds like this optimisation will be worthwhile for a
fairly large group of people.
The next step is to get an accurate and fast caseless_index() function
for the important character sets (at least UTF8 and Big5 will be
important here). Then we should extend "struct charset_functions" to
have a method called caseless_index() and implement a default
caseless_index() function that uses iconv() and a unicode charset
(either UTF8 or UCS2, depending on if we can make the UTF8
caseless_index() function fast).
One thing we need to test is if any of the UTF-16 characters that
cannot be represented in UCS-2 have case. We will need a new
smbtorture test program that works with UTF-16 to test this.
Monyo, can you or someone else in the Japanese Samba group write a
caseless_index() function for Big5? If you can also write a
case-insensitive strcmp() function directly in Big5 that would be
great as well. I think we should put a
compare_string(const char *, const char *);
function in "struct charset_functions" as well. Then we will extend
the lib/iconv.c API to implement string comparison.
We should probably create a separate directory source/lib/iconv/ and
start putting the charset specific functions in there. For example we
could have source/lib/iconv/big5.c and source/lib/iconv/utf8.c plus
whatever other direct character set modules are important. Obviously
we will still support all character sets that normal iconv() supports
via conversion, but I think it makes sense to make non-converting
functions for the most widely used (and complex) charsets like Big5.
No rush on this of course, Samba4 is still a fair way off!
More information about the samba-technical