i18n question.

Tue Mar 16 02:16:53 GMT 2004

Monyo,

I'm currently in Hong Kong (supposedly on holiday!), and I've met up
with some of the Hong Kong Linux user group to discuss charset
support. While talking with them I realised that with some fairly
small changes we can make Japanese and Chinese Samba actually much
_faster_ than Samba in English.

The key is to realise that it is quite common for pathname components
(and sometimes whole path names) in Asian languages like Japanese and
Chinese to be completely caseless. By "caseless" I mean that every
character in the name has no uppercase or lowercase pair. 

This means that for those path components the case-insensitive search
that Samba does all the time to convert from case-sensitive POSIX
filesystem to case-insensitive windows semantics is pointless. If
every character in a component of a pathname is caseless then the only
thing we need to do is a single stat(), we do not need to do a
directory scan. Fixing this would also greatly reduce the number of
times we need to call our case insensitive string comparison routine,
which in turn would greatly reduce the number of times that the string
would need to be converted between charsets.

So, I think what we need to do is write a function like this:

 int caseless_index(const char *);

it would return the index (in bytes) of the first character that is
not caseless in the string. If all characters are caseless then it
would return -1. So for example:

   caseless_index("abc") == 0
   caseless_index("12d3") == 2
   caseless_index("1234") == -1

Now comes the really interesting part ...

While I have been told that it is common for filenames in Chinese and
Japanese to be purely caseless, they often still have the old DOS 3
letter extensions (like .doc, .xls, .txt etc). There are a number of
ways we can handle these:

1) there are only 8 possible case combinations for a 3 letter
   extension. We could call stat() on all 8, and avoid the directory
   scan. This will be a win for large dirctories and a loss for small
   directories. We might need a heuristic to decide which method to
   use.

2) The extension contains minimal information. I think it would be
   reasonable for many applications to force the case on the 3 letter
   extension to lowercase, and then assume that only filenames with
   lowercase 3 letter extensions exist. That makes it a single stat().

3) we could do what we do now, which is to do a full directory scan,
   but we could have an accelerator caseless comparison function that
   compares the leading part of the string which is caseless (using
   memcmp()) and only check the case-sensitive part if the leading
   part matches. 

Note that the accelerator in (3) applies to _all_ caseless string
comparisons in Samba, not just pathname comparisons. This is really the
equivalent accelerator to the 7 bit accelerator we have now, except
that it works well with mostly-caseless languages (like most Asian
languages I believe).

To implement this I suggest we extend the "struct charset_functions"
structure in the Samba built-in iconv implementation (see lib/iconv.c)
to have optional function pointers for caseless_index() and perhaps
other accelerator related functions. 

If a charset module does not provide these functions then we will fall
back to emulating them via conversion to UTF-16 and the existing case
table (which can easily be made to work with UTF-16, even though it
was meant for UCS-2, thanks to the fact that non-UCS2 characters tend
to be caseless.)

I think the above schemes will allow Samba to be _very_ fast for
Japanese and Chinese character sets.

While I am here, I would like help from someone to convert a NBENCH
load file from English characters to Japanese or Chinese. That will
give us a benchmark to use for speed comparisons.

Cheers, Tridge