i18n question.

Sat Mar 6 04:19:19 GMT 2004

Dear Andrew,

>>>>> "Andrew" == Andrew Bartlett <abartlet at samba.org> writes:
>> Pointout: UTF16 is not UCS2. What we really need is not UTF16->UTF8,
>> but is UCS2->UTF8 ( and vice versa, ofcourse ).
Andrew> MS has redefined their use of USC2 to be UTF16.  I'm not sure how well
Andrew> they have done that, but their current interface doco claims to be using
Andrew> UTF16.

I didn't know that...
I wonder if MS really understand what they mean...

>> We do call stat() many times, but we call stat() against "same
>> string" many times.
Andrew> I think this is inventing a solution for a non-problem.  Why make every
Andrew> single system call need to consult a character conversion cache, when we
Andrew> can instead convert to UTF8 at the wire interfaces?

Because UTF-8 can not handle case-insensitive search of russian or
greek characters in ease.

Because unix IO character codes are not only UTF-8.

Those two problems togather makes requirement.

>> Q2) I don't see what you mean by "skip UCS2 because this isn't
>> java".
Andrew> Java's 'system call' interface is natively unicode, which means you
Andrew> don't need to think about this nearly as much.

Ah, I see.

But that's only view point for ASCII user.

For other language, UCS2 and UTF16 are as difficult as UTF-8, for
interfaces requires us to do character conversion anyway.

>> Once any string is converted to UCS2, we can treat them just
>> like ascii, except that we do need to care for 16bit length.
Andrew> Except that all the basic system functions, from getpwnam() to printf()
Andrew> etc are not expecting UCS2 inputs.

They are no expecting UTF-8 nether. They are expecting ASCII only.

Think. There are many OS and libraries in Japan that supports EUC as
Japanese handling.  What will happen if you pour UTF-8 strings to
those functions?
# As I mentioned on other mail, EUC have 2 and 3 byte characters
# of their own. UTF-8 have such cases too.

What you're saying simply is that you can use those functions in
case of UTF-8 and as long as you use only characters that fits
within ASCII.

Any language that uses outside ASCII, will face that nether
getpwnam() nor printf() works properly against their language. And
that's not just Japanese but French, Germen, etc. too.

Well, those libraries for Linux may work. But not for more old
(LEGACY) OS.

Hence, you shouldn't use printf(), except for debugging purpose, if
you want to make Samba working on such legacy OS as well as new one.

>> Q3) Wasn't UCS2 part of 'C' string from ANSI-C?
>> Or are you saying " 'C' string " in meaning of old K&R ?
Andrew> A 'C' string is a string that is null terminated, with no intermediate
Andrew> NULL bytes, such as you can pass to printf().  USC2 is not compatible
Andrew> with that, as every second character may often be NULL, terminating the
Andrew> string.

... Not in case of Japanese (^o^) Japanese have both 2 bytes of non-zero.
# No, I'm just joking about line above.

And yes, I see you're thinking about K&R string.

But no, I don't see the reason you HAVE TO. As I mentioned, we can't
use legacy libraries anyway. We need string conversion in order to
do that. In other word, as long as MANIPULATING strings, we can't
rely to existing libraries anyway.

Then, why stick to 'C' string? There are more diadvantages than advantage.

best regards,
---- 
Kenichi Okuyama.