i18n question.

Michael B Allen mba2000 at ioplex.com
Sun Mar 7 07:50:58 GMT 2004


Kenichi Okuyama said:
> Dear Michael,
>
>>>>>> "Michael" == Michael B Allen <mba2000 at ioplex.com> writes:
>>> Unicode do not fully contain what we had in CP932 nor EUC nor
>>> JIS. There is 'machine dependent' characters which causes trouble.
> Michael> So you're claiming you cannot map these to Unicode? If so, then you cannot
> Michael> use Windows fireservers with Unicode? Do you run entierly in a SHIFT-JIS,
> Michael> EUC-JP, or CP932 locale?
>
> SHIFT-JIS is CP932. So you don't have to worry about SHIFT-JIS.
>
> Uh... You are asking me to explain over 20years of history of characters
> in one mail...

No, I just asked if you can represent Japanese with all scripts provided by the Unicode
standard. I also would very much like to know if Japanese Windows users use a version of
Windows that is different from the west. Do you run in a specific locale such as Shift-JIS or
are filenames encoded in Unicode?

> You should read 'CJKV' for correct and presice story...

I have done a little homework just now. I skimmed over the following descriptions:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/japan.inf-032092.sjs
  http://www.modlangs.gatech.edu/Programs/Japanese/s-jis.html

After reading this information I believe I misunderstood your description of "toublesome
characters". I thought you claimed that these characters could not be represented using
Unicode. This is not true. They can. I suspect the real problem is you cannot make the "round
trip" without the possibility for loosing information. Is this true? If so, how will using
fixed width characters internally within Samba help you? Assuming Samba used a larger fixed
width encoding internally, what charset would you use?

>>> There are backups, there are old datas, which may have chance of
>>> such trouble-some characters.
> Michael> So add conversion to your restore procedures.
>
> How can you prove that will not cause error?

I am still highly confused by this statement. If you tied to convert a filesystem with
pathnames encoded in Shift-JIS to Unicode but Unicode was not sufficient to represent
"troublesome characters", would the conversion tool not fail with EILSEQ?

Or perhaps the problem is that after the conversion, clients would not interpret these
characters in the same way they did when Shift-JIS was used? For example, because the user
used an input method that created a different code point?

> Michael> I would like to know more about these "trouble-some characters". Can you
> Michael> provide a link in english that describes in detail a case where converting
> Michael> to Unicode fails do to inadequte charset or character encoding support?
> Michael> Please provide a byte sequence in CP932 that does not map to Unicode.
>
> Read CJKV... That's almost only information source I can give you in
> English ( yes English is the biggest barrier you're giving me ).

Your english is much better than my french :)

> You see, most of the people was like you, thinking that character
> conversion is easy. On other hand, most of those could not
> understand why same character code with same charset stands for
> different character, was not even interested in this kind of problem
> ( or should I say they were closing eye against the problem ).

That doesn't make sense. Do you mean that there are multiple code points that represent the
same glyph?

>>> "stop using those troublesome characters"
>>> "you can't use that character anymore"
>>> "you will not be able to access to such filename"
> Michael> This sounds like more of a logistical problem rather than a software one.
>
> But it start from charset conversion problem.
> So it is also software problem.
>
> And no matter whether it is logistical or software problem,
> it does not matter as long as they do not move to UTF-8.

It's a logistical problem because it requires a significantly coodinated effort. To convert
your mixture of EUC-JP and different dialects of CP932 filenames it would be necessary to do
an analysis of all filenames on all fileservers, establish an organization wide (or
nationwide) strategy and develop the tools for converting filenames. All of that has to
consider how clients will interpret filenames with "toublesome characters", what input methods
are used, special third party software, etc. So conversion is pretty insignificant compared to
what is required for the end-to-end solution.

>>> I don't agree with you about "optimal encoding is the filesystem
>>> encoding", because I have seen many filesystem using CP932 which
>>> have problem over '\' treatments.
>>> # As you know '\' have special meaning for C, which time after time
>>> # is causing trouble, not only in Samba, but for other programs too.
> Michael> If you're iterating over each character in a multibyte sequence it will be
> Michael> necessary to convert each to a wide character. UTF-8 just happens to be
> Michael> designed such that you don't have to do this to search for ascii
> Michael> characters.
>
> Yes and No and Yes.
>
> Yes wide character(not in meaning of wchar) is easiest way. And
> that's reason why we push UTF-16.  UTF-16 works for wide character
> in most case. If not, use UCS4 or UCS8 or whatever.
> wchar do not help, because wchar is unclearly defined nor
> can not force what charset to use.

You repeatedly reference this '\' problem but that is an optimization. I do not think Japanese
users should be concerned with using a fixed width encoding just to take advantage of
optimizations such as this.

To correctly iterate over each character in the string, each sequence of bytes must be
converted into a single integer of the target codeset. Wchar_t just provides an advantage
because the wide character functions can be used to assist with string handling (e.g.
towupper).

> For example, Cyrillics. I've learned from other Japanese people that
> current 3.x Cyrillic case-insensitive code converts from UTF-8 to
> UCS2, then handle whatever nessasary, and re-convert back to UTF-8.
> This is happening for Greek and other languages too.
> You call this think 'good performance'? I don't.
> Do we have better way? No as long as we use UTF-8.
> What is better solution? use UTF-16!
>
> The very fact that you think UTF-8 is in good performance, is simply
> because you live only within ASCII world. Most of the people are not.
> Even within USA, many people are not.

No. I never said UTF-8 provided good performance. If you look at my first post, you will see
that I guessed the reason why Japanese users were unhappy with UTF-8 was because of the
performance impact.

Yes. The performance impact of UTF-8 for Japanese is going to be significant in some cases
because to properly iterate over each character in a string it is REQUIRED to convert each
multibyte sequence to a single integer codepoint.

In practice however, I think UTF-8 could be made to work fairly well for Japanese. An example
is the utf8casecmp I posted earlier. If you think about the sequence of events that occur
there is ultimately very little conversion going on. It's all in the technique and how much
common string routines are reused/normalized.

>>> I do agree with you that one of the most favorable filesystem
>>> charset is UTF-8, but FS charset is not something we can have
>>> control over.
> Michael> UTF-8 is favorable because it is a Unicode encoding that can be used
> Michael> directly with the filesystem api. My understanding was that Japanese could
> Michael> be adequately represented using Unicode. I would very much like to see
> Michael> specific examples where this is not true. Please provide a link so that I
> Michael> can educate myself. I sincerely want to make my C projects as accessible
> Michael> as I can.
>
>
> You believe that 'Unicode can be used' as long as interface is
> 'UTF-8'. Well ofcourse that's true. But 'Unicode is not one and only
> way to describe Japanese' or should I say 'Unicode only contains
> very small subset of Japanese'. So very fact that UTF-8 is
> supporting some Japanese, do not means they support ENOUGH.

The Unicode website claims to encode Kanji (Han), Katakana, and Hiranga scripts. That has to
be like 12,000 characters. That is a "very small subset of Japanese"? Again, assuming you used
UCS2 internally, what charset would you use?

> Michael> If you must continue to use EUC-JP, SHIFT-JIS and CP932 then I recommend
> Michael> creating a string abstraction that parameterizes all of the string
> Michael> operations that need to consider the charset or encoding. You could then
> Michael> link against a different set of routines to get differnet character
> Michael> behavior without changing any of the string routines.
>
> I don't see the reason why thing have to be so complex.

It's not "complex" it is *indirection*. Think of it like the VFS without the ops structure.

> I mean, I know what you're saying because that's how we used to
> solve the problem in 2.2.x. But look at 3.0 code. The solution code
> we've added are all gone?!

Donno. I don't actually know Samba code too well.

> Think carefully. Why are they gone?!? Did Tridge spite on us?
> Ofcourse not!  It is because such structure adds unrealistic
> complexity to Samba.

Then I would argue the string abstraction was not implemented well.

> In 2.2.x, we were only focusing on multi-lingual, like English and
> Japanese. It was terrible, but we could handle it.
>
> Now in 3.x and above, we have to look for I18N, we can't say "Use
> xxx for case of yyy".  We shouldn't. We need one single style which
> solves everything. At least everything that Windows can handle. And

I believe it is reasonable to create a string abstraction such that a string becomes just a
chunk of memory. Direct pointer manipulation/arithmetic would have to be replaced with
functions and macros. That could be done in a pretty clean way I think. Then you could use
just about any encoding you can think of.

> isn't that the reason why you've selected UTF-8 as FS charset too?

Not really. I think it's used because that's what the filesystem API can take. Otherwise you
have situations such as tridge explained previously.

>>> P.S. I believe Mike and I are looking at same Heaven's gate.
>>> Only, Mike believes gate is right near by.
>>> I believe what we have between gate and us is river.
>>> So Mike says to run, I says to stop and build bridge.
> Michael> You need to get a nationalized initiative to ferry your people over to the
> Michael> side of the river that everyone else is on. I understand you're not happy
> Michael> with the situation but Unicode is the future for the vast majority of us.
>
> Wrong. Ferry works fine only when peoples are few.
>
> You are wrong. We have more people on this side of river.
> You are forgetting Chinese, Korian, Vietnums, Cyrillics, Greeks, and
> other languages which have character outside ASCII.

We can and do use Unicode. Most of my C stuff supports the locale dependant encoding
(including UTF-8) and wchar_t with a recompile. I think I just read China standardized on
Unicode?

> Building Bridge is one and only REALISTIC solution.
> Or else, you should come here once, and divide the river, like Moses.

Moses? I'm a buddist.

Mike-son



More information about the samba-technical mailing list