i18n question.

Wed Mar 17 16:21:18 GMT 2004

trige at samba.org wrote:

|The key is to realise that it is quite common for pathname components
|(and sometimes whole path names) in Asian languages like Japanese and
|Chinese to be completely caseless. By "caseless" I mean that every
|character in the name has no uppercase or lowercase pair. 

If you say "Japanese" at the view of natural language,  this is true.

But actually some of Japanese characters(scripts) are case
sensitive. For example U+FF21 (Fullwidth Latin Capital Letter A) and
U+FF41 (Fullwidth Latin Small Letter A).

Japanese version of Windows NT series OS determines those 2 chars are
same (,but Windows 9x does not ... :-( ).

|So, I think what we need to do is write a function like this:
|
| int caseless_index(const char *);

This idea can be still useful, but for Japanese we cannot simply
assume that non-ASCII chars are case insensitive.

|While I have been told that it is common for filenames in Chinese and
|Japanese to be purely caseless, 
                ~~~~~~ <-- mostly for Japanese
|they often still have the old DOS 3
|letter extensions (like .doc, .xls, .txt etc).

Yes, while we often use Japanese at the basename of a filename,
extensions are mostly written in ASCII.

|There are a number of ways we can handle these:
|
|1) there are only 8 possible case combinations for a 3 letter
|   extension. We could call stat() on all 8, and avoid the directory
|   scan. This will be a win for large dirctories and a loss for small
|   directories. We might need a heuristic to decide which method to
|   use.
|
|2) The extension contains minimal information. I think it would be
|   reasonable for many applications to force the case on the 3 letter
|   extension to lowercase, and then assume that only filenames with
|   lowercase 3 letter extensions exist. That makes it a single stat().
|
|3) we could do what we do now, which is to do a full directory scan,
|   but we could have an accelerator caseless comparison function that
|   compares the leading part of the string which is caseless (using
|   memcmp()) and only check the case-sensitive part if the leading
|   part matches. 

As you said, many applications to force the case on all the 3 letter
extension to lowercase. And most of the remains would do them to
uppercase, I think.

So using 1)

|1) there are only 8 possible case combinations for a 3 letter
|   extension. 
|We could call stat() on all 8, and avoid the directory

, first assuming all the 3 letters are lowercase and second are
uppercase, most of extensions would be matched in those 2 cases.

|I think the above schemes will allow Samba to be _very_ fast for
|Japanese and Chinese character sets.

While some characters are case sensitive as I said, your idea is
considerable even for Japanese.

|While I am here, I would like help from someone to convert a NBENCH
|load file from English characters to Japanese or Chinese. That will
|give us a benchmark to use for speed comparisons.

Yasuma, How about this?

P.S.

I think this is good, too.

-----
TAKAHASHI, Motonobu (monyo)                    monyo at home.monyo.com
                                               http://www.monyo.com/