i18n question.

Kenichi Okuyama okuyamak at dd.iij4u.or.jp
Sat Mar 6 03:50:16 GMT 2004


Dear Michael, and all,

>>>>> "MBA" == Michael B Allen <mba2000 at ioplex.com> writes:
MBA> Just out of curiosity, is the Japanese crowd really not satisfied with
MBA> UTF-8? Is it too slow?

1) There are too many 'already running' system that uses other
   encoding ( or should I say system that have been running before
   someone started talking about Unicode ). We're not new comer
   of unix world.

   This includes moving from old system to new system.
   Since file system ( and tar format ) does not care about
   character set, when we make backup, they do no character
   conversion. And many system administrator do not wish to try
   conversion ( I must say that Unicode do not REALLY fullfil
   other Japanese character encoding. It is rarely used, but
   most of admin do not wish to bet on their luck ).

   On other word, even if treatment of unix file system character
   set is slow because it is not UTF-8, we do take such disadvantage
   if it really works.
   # May the Moore's law be with us :-).

   I strongly agree that if we are creating totally new system,
   and do not need to connect to any existing unix system, we will
   suggest using UTF-8. Unfortunately, the cases are rare.


2) Well, it is true that UTF-8 is slow.

   When you think about conversion between '\\' and '/', you will
   know that treating such a case in UTF-8 and UCS2 does not really
   make big difference. But that's not the only case we have to handle.


   For example, case insensitive search/return of Russian
   characters ( I don't know the correct name of those characters.
   So I'll simply call it Russian here. Please forgive me and let me
   know the right name ).

   Did you know that CP932 have Russian characters in it's character
   set?

   We found that in some version of Windows, Russian characters have
   to be treated case-insensitively. We had patches for this in
   2.2.*, but is lost when we moved to 3.0.

   Having internal character set in UTF-8 and treating such case
   insensitiveness is hard. I do agree that as long as character
   fits within ASCII, UTF-8 is as easy as UCS2. But once outside
   ASCII, UTF-8 is nightmare, for they may ask for length chance
   when we changed character from one to other.
   # We had similar nightmare in EUC and JIS. So we know this is
   # hard. And worst of all is, there was no silver bullet.


   Hence, I'd like to suggest that 'INTERNAL character code' should
   be something like UCS2, fixed length per character. Windows have
   selected UCS2 as character set which is easiest for them to
   manipulate. That means, as long as we use UCS2 for internal code,
   we will not face big problem.

   There are many ways we can speed up the conversion between
   internal and unix IO, like:
      - 'string object' knows internal, and unix IO byte array
        image of same 'STRING'. Their conversion may be lazy,
	but once converted, they keep that as cache.

	Since we call stat and many system calls to same string
	many times, the cache should work.

      - Conversion between UCS2 and UTF-8 is very quick.
        So, if you are using UTF-8, the cost should be very low.
        This means if you are using ascii, conversion cost is also
        low.

        If you are using CP932 or other characters.... well,
	we are already paying much. So, as long as UCS2->CP932
        is being held directly ( not UCS2->UTF8->CP932 or such ),
	the cost do not change that much, and we can afford it.


3) What we really do not wish to have is complex character handling
   system.

   You see, we have EUC, JIS, CP932 and UTF-8 as unix IO coding in
   Japan.

   CP932 is mixture of 1byte ASCII and 2byte SJIS character with 2nd
   byte having ASCII area.

   EUC is 1byte ASCII and 2 or 3 byte MB character using only
   0x80-0xFF.

   JIS only uses 0x00-0x7F, and have MODE-CHANGE escape string
   inbetween.

   You know about UTF-8.

   It's nightmare.


   So. We do not want to do '/' <-> '\\' conversion or such for each
   of code set. We do not want to face 'unix IO code' while internal
   string handling. We only want to convert them once, and after
   conversion, we don't want to manipulate them.

   As long as internal codes are in UCS2, writing code for
   manipulation is easy to debug ( at least easier than facing 4
   different encodings ). As long as conversions are being held at
   only one point, it is easy to debug too.


   User do take slow system, as long as conversions only takes 1 or
   2 msec per entry. But they do not take system that cannot handle
   what Microsoft have handled.

   So, our 1st priority wish is to have system that is easy to
   debug. Not fast system.


Any questions, opinions are welcome.

best regards,
---- 
Kenichi Okuyama



More information about the samba-technical mailing list