i18n question.

Sat Mar 6 05:10:38 GMT 2004

Kenichi,

 >    Having internal character set in UTF-8 and treating such case
 >    insensitiveness is hard. I do agree that as long as character
 >    fits within ASCII, UTF-8 is as easy as UCS2. But once outside
 >    ASCII, UTF-8 is nightmare, for they may ask for length chance
 >    when we changed character from one to other.
 >    # We had similar nightmare in EUC and JIS. So we know this is
 >    # hard. And worst of all is, there was no silver bullet.

UTF-8 is less of a nightmare than any other solution. 

 >    Hence, I'd like to suggest that 'INTERNAL character code' should
 >    be something like UCS2, fixed length per character. Windows have
 >    selected UCS2 as character set which is easiest for them to
 >    manipulate. That means, as long as we use UCS2 for internal code,
 >    we will not face big problem.

nope, UCS2 is dead, even in windows land. 

Microsoft have re-defined all of their UCS2 interfaces as being
UTF-16. That means that it is now variable length. Samba hasn't caught
up with this change yet (we still treat it as UCS2) but we will need
to cope with this soon. Any moves towards any fixed length charset in
Samba are pointless after this.

I am quite sure that the current design in Samba3 and Samba4 is the
correct design for us. There are bugs certainly, and there are minor
improvements we can make, but the basic design is sound. The design is
this:

 * internally store all strings in "unix charset". Default this to
   UTF-8. Convert to/from UTF-8 at wire boundary.

 * code assumes "sane" properties for "unix charset". This
   includes:
     + "ascii compatible" so C string constants work
     + null terminated

I also think the accelerators that assume that strings that _only_
contain 7 bit bytes can be compared in a case-insensitive manner using
a 7 bit table are OK. If this doesn't work for some charset then we
can add loadable hooks that allow these functions to be replaced. (Are
there really any charsets we care about that don't obey this rule?
read the rule very carefully before answering).

On the other hand, code that assumes that you can search for '/' or
'\' in a string and assume that is the start of a character is more of
a problem. The solution is definately not to switch away from 
"char *", instead we need to define a clean function that does
directory parsing correctly and use this function everywhere its
needed.

I'd also like to point out why we will not separate the internal
charset from the filesystem charset. For each SMB request we make
more than one system call (ranging from 2 or 3 up to hundreds in some
cases). This "multiplier effect" is what makes converting to the
filesystem charset at the wire boundaries so advantageous. 

Cheers, Tridge