i18n question.

Thu Mar 18 00:02:30 GMT 2004

Monyo,

 > But actually some of Japanese characters(scripts) are case
 > sensitive. For example U+FF21 (Fullwidth Latin Capital Letter A) and
 > U+FF41 (Fullwidth Latin Small Letter A).

Interesting. Is this rare? If you have 1000 filenames on a filesystem
in Japanese how many of them would contain characters like this that
are case sensitive?

 > This idea can be still useful, but for Japanese we cannot simply
 > assume that non-ASCII chars are case insensitive.

That's fine, it just means that caseless_index() function needs to be
a bit more complex. I suspect it will still be a big win.

 > |1) there are only 8 possible case combinations for a 3 letter
 > |   extension. 
 > |We could call stat() on all 8, and avoid the directory
 > 
 > , first assuming all the 3 letters are lowercase and second are
 > uppercase, most of extensions would be matched in those 2 cases.

That's not how it works. If the filename does exist then 99% of the
time we will find it on the first stat() call, either through a guess
or via the "stat cache" code.

The interesting case is where the file doesn't exist, and that is the
case that I am trying to improve with this scheme. About half the time
when a windows client tries to open a file the filename does not
exist. The problem is proving with absolute certainty that it doesn't
exist. In English that means scanning the directory. 

I hope that with this scheme we can avoid the scan even for files that
do not exist, as long as the filename uses caseless characters. I am
hoping that will be common enough in Japanese and Chinese to be
worthwhile.

 > |While I am here, I would like help from someone to convert a NBENCH
 > |load file from English characters to Japanese or Chinese. That will
 > |give us a benchmark to use for speed comparisons.
 > 
 > Yasuma, How about this?

See my separate reply about this.

Cheers, Tridge