Help Please

Kenichi Okuyama okuyamak at dd.iij4u.or.jp
Tue May 22 00:49:11 GMT 2001


Dear Jeremy,

>>>>> "JA" == Jeremy Allison <jeremy at valinux.com> writes:
JA> The I18N points he brings up are interesting, but not relevent
JA> as long as we must interoperate with Microsoft clients (ie.
JA> forever :-). They use ucs2 on the wire, and nothing fancy
JA> we do with ucs4 internally would change that fact, so it's
JA> not useful to consider 4-byte wchar_t's in Samba.

I'm not talking about 32bit internal code to use UCS4.


Did you check what MacOS-X will give as UTF-8? For example,
Character 0x30AC (KATAKANA LETTER GA) is also described as
0x30AB 0x3099 (KATAKANA LETTER KA + Double dots no right top),
and MacOS-X return latter format as default.

... I seems to lost the list locally(so I can't give you exact code
number) but you'll have two way of describing character SMALL E WITH
DOUBLE DOTS ON TOP, too. We need to treat these two equivalent.
# We have many Laten-1 characters that will be divided into
# two words, even if we simply forget about Japanese.

Using UTF-8 will not help for these cases, for UTF-8 only describe
the way to convert 0x30AC to specific byte stream, and 0x30AB 0x3099
to different. So, if you want to scan character 0x30AB, you have to
skip 0x30AB 0x3099 cases.


Windows 98 and NT seems to be using Unicode version 2, and since so,
they only give you 0x30AC cases only. But I don't know about 2000,
and I think XP will start catching up to Unicode version 3 as
internal code. So, I think we should have care about them, more than
simply converting.


Why bother? Because we should not make 0x30AB 0x3099 to 0x30AC as
internal code. "Using 0x30AC" means we assume that any character
code have single word description on Unicode world, which we can't.

Currently all the character described in two words have single word
description, because Unicode peoples did add new description but did
not add new character. This will not last for long. Japanese and
Chinese, Korian are planning to have more characters ( we only have
less than 1/100 of entire character in computer yet ). We're lacking
code space on Unicode, and one of the way to solve this problem is
to use spaces being made by using two word descriptions.


All the above means we need "HUGE" character as internal, converting
from 0x30AC to 0x30AB 0x3099 at timing of Wire->internal, and
unix->internal, and do nessasary conversion from 0x30AB 3099 on
internal-><any>. We need "HUGE" character because Unicode is still
growing, and does not seems to be stable, and we don't want those to
effect Samba's internal character code treatment.


By the way, using UCS4 does not help the example above. They simply
have 0x000030AC and 0x000030AB 0x00003099.
---- 
Kenichi Okuyama at Tokyo Research Lab. IBM-Japan, Co.
               @Samba Users Group in Japan




More information about the samba-technical mailing list