i18n question.
Kenichi Okuyama
okuyamak at dd.iij4u.or.jp
Sat Mar 6 03:50:16 GMT 2004
Dear Michael, and all,
>>>>> "MBA" == Michael B Allen <mba2000 at ioplex.com> writes:
MBA> Just out of curiosity, is the Japanese crowd really not satisfied with
MBA> UTF-8? Is it too slow?
1) There are too many 'already running' system that uses other
encoding ( or should I say system that have been running before
someone started talking about Unicode ). We're not new comer
of unix world.
This includes moving from old system to new system.
Since file system ( and tar format ) does not care about
character set, when we make backup, they do no character
conversion. And many system administrator do not wish to try
conversion ( I must say that Unicode do not REALLY fullfil
other Japanese character encoding. It is rarely used, but
most of admin do not wish to bet on their luck ).
On other word, even if treatment of unix file system character
set is slow because it is not UTF-8, we do take such disadvantage
if it really works.
# May the Moore's law be with us :-).
I strongly agree that if we are creating totally new system,
and do not need to connect to any existing unix system, we will
suggest using UTF-8. Unfortunately, the cases are rare.
2) Well, it is true that UTF-8 is slow.
When you think about conversion between '\\' and '/', you will
know that treating such a case in UTF-8 and UCS2 does not really
make big difference. But that's not the only case we have to handle.
For example, case insensitive search/return of Russian
characters ( I don't know the correct name of those characters.
So I'll simply call it Russian here. Please forgive me and let me
know the right name ).
Did you know that CP932 have Russian characters in it's character
set?
We found that in some version of Windows, Russian characters have
to be treated case-insensitively. We had patches for this in
2.2.*, but is lost when we moved to 3.0.
Having internal character set in UTF-8 and treating such case
insensitiveness is hard. I do agree that as long as character
fits within ASCII, UTF-8 is as easy as UCS2. But once outside
ASCII, UTF-8 is nightmare, for they may ask for length chance
when we changed character from one to other.
# We had similar nightmare in EUC and JIS. So we know this is
# hard. And worst of all is, there was no silver bullet.
Hence, I'd like to suggest that 'INTERNAL character code' should
be something like UCS2, fixed length per character. Windows have
selected UCS2 as character set which is easiest for them to
manipulate. That means, as long as we use UCS2 for internal code,
we will not face big problem.
There are many ways we can speed up the conversion between
internal and unix IO, like:
- 'string object' knows internal, and unix IO byte array
image of same 'STRING'. Their conversion may be lazy,
but once converted, they keep that as cache.
Since we call stat and many system calls to same string
many times, the cache should work.
- Conversion between UCS2 and UTF-8 is very quick.
So, if you are using UTF-8, the cost should be very low.
This means if you are using ascii, conversion cost is also
low.
If you are using CP932 or other characters.... well,
we are already paying much. So, as long as UCS2->CP932
is being held directly ( not UCS2->UTF8->CP932 or such ),
the cost do not change that much, and we can afford it.
3) What we really do not wish to have is complex character handling
system.
You see, we have EUC, JIS, CP932 and UTF-8 as unix IO coding in
Japan.
CP932 is mixture of 1byte ASCII and 2byte SJIS character with 2nd
byte having ASCII area.
EUC is 1byte ASCII and 2 or 3 byte MB character using only
0x80-0xFF.
JIS only uses 0x00-0x7F, and have MODE-CHANGE escape string
inbetween.
You know about UTF-8.
It's nightmare.
So. We do not want to do '/' <-> '\\' conversion or such for each
of code set. We do not want to face 'unix IO code' while internal
string handling. We only want to convert them once, and after
conversion, we don't want to manipulate them.
As long as internal codes are in UCS2, writing code for
manipulation is easy to debug ( at least easier than facing 4
different encodings ). As long as conversions are being held at
only one point, it is easy to debug too.
User do take slow system, as long as conversions only takes 1 or
2 msec per entry. But they do not take system that cannot handle
what Microsoft have handled.
So, our 1st priority wish is to have system that is easy to
debug. Not fast system.
Any questions, opinions are welcome.
best regards,
----
Kenichi Okuyama
More information about the samba-technical
mailing list