i18n question.

Kenichi Okuyama okuyamak at dd.iij4u.or.jp
Sun Mar 7 03:39:29 GMT 2004


Dear Michael,

>>>>> "Michael" == Michael B Allen <mba2000 at ioplex.com> writes:
>> Unicode do not fully contain what we had in CP932 nor EUC nor
>> JIS. There is 'machine dependent' characters which causes trouble.
Michael> So you're claiming you cannot map these to Unicode? If so, then you cannot
Michael> use Windows fireservers with Unicode? Do you run entierly in a SHIFT-JIS,
Michael> EUC-JP, or CP932 locale?

SHIFT-JIS is CP932. So you don't have to worry about SHIFT-JIS.

Uh... You are asking me to explain over 20years of history of characters
in one mail... You should read 'CJKV' for correct and presice story...


>> There are backups, there are old datas, which may have chance of
>> such trouble-some characters.
Michael> So add conversion to your restore procedures.

How can you prove that will not cause error?


Ah, by the way, it's not MY restore procedures. For mine, they are
all done, not because I took risk at restoring procedure point, but
because I forced not to use characters that will cause problem, at
beginning when I started my server.

Hence, for those which are causeing problem, I have no control over
it to force conversion. I need PROOF of conversion not causing any
problem, in order to let others follow.


Michael> I would like to know more about these "trouble-some characters". Can you
Michael> provide a link in english that describes in detail a case where converting
Michael> to Unicode fails do to inadequte charset or character encoding support?
Michael> Please provide a byte sequence in CP932 that does not map to Unicode.

Read CJKV... That's almost only information source I can give you in
English ( yes English is the biggest barrier you're giving me ).

You see, most of the people was like you, thinking that character
conversion is easy. On other hand, most of those could not
understand why same character code with same charset stands for
different character, was not even interested in this kind of problem
( or should I say they were closing eye against the problem ).

And those who was closing eye, defined Unicode.


>> "stop using those troublesome characters"
>> "you can't use that character anymore"
>> "you will not be able to access to such filename"
Michael> This sounds like more of a logistical problem rather than a software one.

But it start from charset conversion problem.
So it is also software problem.

And no matter whether it is logistical or software problem,
it does not matter as long as they do not move to UTF-8.


>> I don't agree with you about "optimal encoding is the filesystem
>> encoding", because I have seen many filesystem using CP932 which
>> have problem over '\' treatments.
>> # As you know '\' have special meaning for C, which time after time
>> # is causing trouble, not only in Samba, but for other programs too.
Michael> If you're iterating over each character in a multibyte sequence it will be
Michael> necessary to convert each to a wide character. UTF-8 just happens to be
Michael> designed such that you don't have to do this to search for ascii
Michael> characters.

Yes and No and Yes.

Yes wide character(not in meaning of wchar) is easiest way. And
that's reason why we push UTF-16.  UTF-16 works for wide character
in most case. If not, use UCS4 or UCS8 or whatever.
wchar do not help, because wchar is unclearly defined nor
can not force what charset to use.

No, there are several other ways to solve. But very complex.


Hence, yes. You are right and that's why we want to have that WIDE
CHARACTER as 'internal charset'. UTF-8 do not do well enough as
internal charset, because there are some characters outside ascii
space that we need to deal with. I'm pushing UCS2 or UCS4 or
whatever.

For example, Cyrillics. I've learned from other Japanese people that
current 3.x Cyrillic case-insensitive code converts from UTF-8 to
UCS2, then handle whatever nessasary, and re-convert back to UTF-8.
This is happening for Greek and other languages too.
You call this think 'good performance'? I don't.
Do we have better way? No as long as we use UTF-8.
What is better solution? use UTF-16!

The very fact that you think UTF-8 is in good performance, is simply
because you live only within ASCII world. Most of the people are not.
Even within USA, many people are not.


>> I do agree with you that one of the most favorable filesystem
>> charset is UTF-8, but FS charset is not something we can have
>> control over.
Michael> UTF-8 is favorable because it is a Unicode encoding that can be used
Michael> directly with the filesystem api. My understanding was that Japanese could
Michael> be adequately represented using Unicode. I would very much like to see
Michael> specific examples where this is not true. Please provide a link so that I
Michael> can educate myself. I sincerely want to make my C projects as accessible
Michael> as I can.


You believe that 'Unicode can be used' as long as interface is
'UTF-8'. Well ofcourse that's true. But 'Unicode is not one and only
way to describe Japanese' or should I say 'Unicode only contains
very small subset of Japanese'. So very fact that UTF-8 is
supporting some Japanese, do not means they support ENOUGH.

On other hand, if you do not know Japanese, you're not good customer
at all, for those who support larger set of Japanese. Hence, it is
very difficult for me to search such a WebPage...


Michael> If you must continue to use EUC-JP, SHIFT-JIS and CP932 then I recommend
Michael> creating a string abstraction that parameterizes all of the string
Michael> operations that need to consider the charset or encoding. You could then
Michael> link against a different set of routines to get differnet character
Michael> behavior without changing any of the string routines.

I don't see the reason why thing have to be so complex.

I mean, I know what you're saying because that's how we used to
solve the problem in 2.2.x. But look at 3.0 code. The solution code
we've added are all gone?!


Think carefully. Why are they gone?!? Did Tridge spite on us?
Ofcourse not!  It is because such structure adds unrealistic
complexity to Samba.

In 2.2.x, we were only focusing on multi-lingual, like English and
Japanese. It was terrible, but we could handle it.

Now in 3.x and above, we have to look for I18N, we can't say "Use
xxx for case of yyy".  We shouldn't. We need one single style which
solves everything. At least everything that Windows can handle. And
isn't that the reason why you've selected UTF-8 as FS charset too?

Hence so, we need Internal Charset independent from unix charset.
Even if our customer wish for CP932 or EUC or EBCDIC for unix
charset ( don't ask me why, and don't blame on me ), Samba internal
charset should not be effected by such selection.


Michael> Start with just a stub implementation that supports what samba has now.

Stub will not do enough job.

Stub works only when Samba internal structure are well thought about
it.  Unfortunately, current Samba is not. string structure do not
even tell us what charset we are using for that specific string.


>> P.S. I believe Mike and I are looking at same Heaven's gate.
>> Only, Mike believes gate is right near by.
>> I believe what we have between gate and us is river.
>> So Mike says to run, I says to stop and build bridge.
Michael> You need to get a nationalized initiative to ferry your people over to the
Michael> side of the river that everyone else is on. I understand you're not happy
Michael> with the situation but Unicode is the future for the vast majority of us.

Wrong. Ferry works fine only when peoples are few.

You are wrong. We have more people on this side of river.
You are forgetting Chinese, Korian, Vietnums, Cyrillics, Greeks, and
other languages which have character outside ASCII.


Building Bridge is one and only REALISTIC solution.
Or else, you should come here once, and divide the river, like Moses.


best regards,
---- 
Kenichi Okuyama



More information about the samba-technical mailing list