i18n question.

Kenichi Okuyama okuyamak at dd.iij4u.or.jp
Sat Mar 6 10:01:23 GMT 2004


Dear Michael,

>>>>> "Michael" == Michael B Allen <mba2000 at ioplex.com> writes:
>> This includes moving from old system to new system.
>> Since file system ( and tar format ) does not care about
>> character set, when we make backup, they do no character
>> conversion. And many system administrator do not wish to try
>> conversion
Michael> There are many good tools for converting entire filesystems from one
Michael> encoding to another. Moving foward this will be the correct solution.

Not always.

Unicode do not fully contain what we had in CP932 nor EUC nor
JIS. There is 'machine dependent' characters which causes trouble.


>> ( I must say that Unicode do not REALLY fullfil
>> other Japanese character encoding. It is rarely used, but
>> most of admin do not wish to bet on their luck ).
Michael> Why would it be a matter of "luck"? If an illegal character sequence is
Michael> encountered the conversion utility should report an error. I don't beleive
Michael> this as serious as you claim either. If you cannot represent your
Michael> filenames with Unicode you have a much bigger problem.

Currently running system is NOT the ONE AND ONLY thing we need to
face.

There are backups, there are old datas, which may have chance of
such trouble-some characters.

Nobody really knows what characters are being used on where. And
even if such troublesome characters were not being used wheh IT
peoples planned for moving from old system to new system, that does
not mean you will not face such characters on day when you really move.

Hence, even if you have thousands of tools to treat, they do need luck.

And most of the IT peoples do not wish to use luck, but rather, they
stick to old charsets.


I agree that they are simply escaping from risk, and will have to
pay for it someday. I definitely agree that moving to UTF-8 NOW is
far better.

BUT! when the day comes where moving to UTF-8 is THE ONLY CHOICE, it
is not IT people's fault to MOVE. It is not IT people's fault of
asking marketing, sales, development and other organization to

"stop using those troublesome characters"
"you can't use that character anymore"
"you will not be able to access to such filename"

The reason comes from outside world. NOW THEY HAVE ENOUGH EXCUSE TO
FORCE SUCH RULE.


On other word, until such a day, they have to explain that moving to
UTF-8 NOW is MORE COSTWORTHY, which is nearly impossible to do.



Michael> Cyrillic.

Thanks. I'll remember.


>> Did you know that CP932 have Russian characters in it's character set?
>> 
>> We found that in some version of Windows, Russian characters have to
>> be treated case-insensitively. We had patches for this in
>> 2.2.*, but is lost when we moved to 3.0.

Michael> What versions of Windows would that happen to be? Can you provide
Michael> specifics on this?
>> Hence, I'd like to suggest that 'INTERNAL character code' should be
>> something like UCS2, fixed length per character. Windows have
>> selected UCS2 as character set which is easiest for them to
>> manipulate. That means, as long as we use UCS2 for internal code, we
>> will not face big problem.
Michael> I don't agree. The optimal encoding is the filesystem encoding and the
Michael> most favorable filesystem encoding is UTF-8.

Reason?


Well what I mean as "Reason?" is.

I don't agree with you about "optimal encoding is the filesystem
encoding", because I have seen many filesystem using CP932 which
have problem over '\' treatments.
# As you know '\' have special meaning for C, which time after time
# is causing trouble, not only in Samba, but for other programs too.

I do agree with you that one of the most favorable filesystem
charset is UTF-8, but FS charset is not something we can have
control over.

Hence, I believe:
1) we need internal code to be independent from filesystem charset.
   Even if we are not using UTF-8 as filesystem charset, we do not
   want to face those charsets running around inside smbd/nmbd.


We do not really know the type of traps Microsoft is ready for.
Current design assumption seems like they will not have any
case-insensitive traps for outside ASCII. But I doublt and disagree.
# Current Samba3.x seems to have Cyrillic case-insensitive code on
# charset conversion point, and that's reason why we pass test.  It
# is not that we only have traps in ASCII field.

Hence, I believe:
2) we need internal code to be EQUALLY HANDY among characters.
   Not only ASCII char treatment are easy, but ANY character
   handling should be easy.



But you say no. I want to know WHY.
At least, you agree with me about 1, don't you?

# Or do you say filesystem charset is best even for case of JIS?


>> - Conversion between UCS2 and UTF-8 is very quick.
Michael> It is noticibly slower than no conversion at all.

Who promised you that unix IO charset IS UTF-8?
SHE/HE should at least, promise that to me, too... So that I can get
away from here(T-T).


Michael> I agree. This is pretty bad. You should really be converting to UTF-8
Michael> wherever possible.

Now, now, now.  Don't say that to ME. Do you think I make dicision
about what charset to use on FS ? ;-)
# There were so many times I wish if I could, but I never had.

The very fact that you are PUSHING UTF-8 will be no reason for using
UTF-8 on filesystem. Hence, we can not assume that unix charset IS
UTF-8.


THAT'S reason why we need 'internal charset' independent of unix
charset. Though I push UCS2, and you push UTF-8, I think you
understand about the fact that 'internal charset' SHOULD BE
independent of 'unix charset'. They should never change, no matter
what we may have on wire, nor on FS.


Michael> I don't know what the situation is like in Japan but I would think
Michael> conversion to UTF-8 would be the highest priority. If you complained that
Michael> UTF-8 is too slow that is a valid argument. But coding for UTF-8 is not
Michael> more or less difficult regardless of what language it is being used to
Michael> represent.

Mike.

Biggest problem is 'Nether you nor I have right to decide what
charset to use as FS in Japan' :-)
# And I don't think you have right over Koria nor China, do you?

Hence, we have to face FS charset chaos, no matter how you may hate
it. We can not have assumption that FS uses UTF-8, nor they move
torward UTF-8.

I'm saying this, after struggling in this chaos for 15 years...
Believe me. It will not change for next 5 years or so. At least, not
in area where Samba is applied.


best regards,
---- 
Kenichi Okuyama

P.S. I believe Mike and I are looking at same Heaven's gate.
     Only, Mike believes gate is right near by.
     I believe what we have between gate and us is river.
     So Mike says to run, I says to stop and build bridge.



More information about the samba-technical mailing list