i18n question.

Sat Mar 6 05:48:08 GMT 2004

Kenichi Okuyama said:
> MBA> Just out of curiosity, is the Japanese crowd really not satisfied
MBA> with UTF-8? Is it too slow?
>
> 1) There are too many 'already running' system that uses other
>    encoding ( or should I say system that have been running before
someone started talking about Unicode ). We're not new comer
>    of unix world.
>
>    This includes moving from old system to new system.
>    Since file system ( and tar format ) does not care about
>    character set, when we make backup, they do no character
>    conversion. And many system administrator do not wish to try
>    conversion

There are many good tools for converting entire filesystems from one
encoding to another. Moving foward this will be the correct solution.

> ( I must say that Unicode do not REALLY fullfil
>    other Japanese character encoding. It is rarely used, but
>    most of admin do not wish to bet on their luck ).

Why would it be a matter of "luck"? If an illegal character sequence is
encountered the conversion utility should report an error. I don't beleive
this as serious as you claim either. If you cannot represent your
filenames with Unicode you have a much bigger problem.

> 2) Well, it is true that UTF-8 is slow.
<snip>
>    For example, case insensitive search/return of Russian
>    characters ( I don't know the correct name of those characters.

Cyrillic.

>    Did you know that CP932 have Russian characters in it's character set?
>
>    We found that in some version of Windows, Russian characters have to
be treated case-insensitively. We had patches for this in
>    2.2.*, but is lost when we moved to 3.0.

What versions of Windows would that happen to be? Can you provide
specifics on this?

>    Having internal character set in UTF-8 and treating such case
insensitiveness is hard. I do agree that as long as character fits
within ASCII, UTF-8 is as easy as UCS2. But once outside ASCII, UTF-8
is nightmare, for they may ask for length chance when we changed
character from one to other.
>    # We had similar nightmare in EUC and JIS. So we know this is # hard.
And worst of all is, there was no silver bullet.

Yes, character by character manipulation with UTF-8 is not easy. But
string routines are adequately normalized it should not be necessary to
write these functions frequently.

  /* Case insensitive comparison of two UTF-8 strings
   */
  int
  utf8casecmp(const unsigned char *str1,
        const unsigned char *str1lim,
        const unsigned char *str2,
        const unsigned char *str2lim)
  {
    int n1, n2;
    wchar_t ucs1, ucs2;
    int ch1, ch2;
    mbstate_t ps1, ps2;

    memset(&ps1, 0, sizeof(ps1));
    memset(&ps2, 0, sizeof(ps2));
    while (str1 < str1lim && str2 < str2lim) {
        if ((*str1 & 0x80) && (*str2 & 0x80)) { /* both multibyte */
            if ((n1 = mbrtowc(&ucs1, str1, str1lim - str1, &ps1)) < 0 ||
                    (n2 = mbrtowc(&ucs2, str2, str2lim - str2, &ps2)) < 0)
{
                PMNO(errno);
                return -1;
            }
            if (ucs1 != ucs2 &&
                    (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
                return ucs1 < ucs2 ? -1 : 1;
            }
            str1 += n1;
            str2 += n2;
        } else {                      /* neither or one multibyte */
            ch1 = *str1;
            ch2 = *str2;

            if (ch1 != ch2 &&
                    (ch1 = toupper(ch1)) != (ch2 = toupper(ch2))) {
                return ch1 < ch2 ? -1 : 1;
            } else if (ch1 == '\0') {
                return 0;
            }
            str1++;
            str2++;
        }
    }

    return 0;
  }

>    Hence, I'd like to suggest that 'INTERNAL character code' should be
something like UCS2, fixed length per character. Windows have
selected UCS2 as character set which is easiest for them to
>    manipulate. That means, as long as we use UCS2 for internal code, we
will not face big problem.

I don't agree. The optimal encoding is the filesystem encoding and the
most favorable filesystem encoding is UTF-8.

>       - Conversion between UCS2 and UTF-8 is very quick.

It is noticibly slower than no conversion at all.

>         So, if you are using UTF-8, the cost should be very low. This
means if you are using ascii, conversion cost is also low.
>
>         If you are using CP932 or other characters.... well,
> 	we are already paying much. So, as long as UCS2->CP932

Converting the beast to UTF-8 is a much much better solution.

> 3) What we really do not wish to have is complex character handling
>    system.
>
>    You see, we have EUC, JIS, CP932 and UTF-8 as unix IO coding in Japan.
>
>    CP932 is mixture of 1byte ASCII and 2byte SJIS character with 2nd
byte having ASCII area.
>
>    EUC is 1byte ASCII and 2 or 3 byte MB character using only
>    0x80-0xFF.
>
>    JIS only uses 0x00-0x7F, and have MODE-CHANGE escape string
>    inbetween.

I agree. This is pretty bad. You should really be converting to UTF-8
wherever possible.

>    You know about UTF-8.
>
>    It's nightmare.

UTF-8 is not a nightmare. It's not as easy as fixed width encodings but I
go out of my way to ensure that most if not all of my C stuff works with
UTF-8 as well as most mutlibyte encodings and I don't have too much
trouble with it. And I'm an American!

>    So. We do not want to do '/' <-> '\\' conversion or such for each of
code set. We do not want to face 'unix IO code' while internal string
handling. We only want to convert them once, and after conversion, we
don't want to manipulate them.

I don't see that happening when all filesystem I/O on non-Windows systems
(the systems that Samba runs on) do not support non-multibyte filenames.

I don't know what the situation is like in Japan but I would think
conversion to UTF-8 would be the highest priority. If you complained that
UTF-8 is too slow that is a valid argument. But coding for UTF-8 is not
more or less difficult regardless of what language it is being used to
represent.

Mike