i18n question.

Kenichi Okuyama okuyamak at dd.iij4u.or.jp
Sun Mar 7 08:29:31 GMT 2004


Dear Andrew,

First of all, I think I need to apologise for my bad English.  It
seems like you have gone to wrong idea because of what I've been
trying to explain.


>>>>> "Andrew" == Andrew Bartlett <abartlet at samba.org> writes:
Richard> Are you saying that UTF-8 does not encode all the Japanese glyphs that 
Richard> Japanese people want to use?

Yes and No.

Yes they do not have all the glyphs we want to use.
No, that's not the point I'm saying.


Richard> This is a genuine question. I do not known the answer and it seems to me 
Richard> like that is what you are saying. 

What I'm saying is,

1) We can not move "unix FS charset" to UTF-8 easily.
   Hence, what Andrew is proposing:
	"move unix FS charset to UTF-8 and we will have no problem
	 with mismatch of internal charset of UTF-8, and unix
	 charset of something else"
   is impossible.


Andrew> Does the failure to encode that Japanese glyph in UTF8 matter?

Ofcourse!!!! How do you feel if your system give you '?' against
all the 'e' character? Andr?w?


Andrew> Against a Samba server, assuming we find characters that are in our
Andrew> file-system that are not in UTF16, what should we do with them?  Should
Andrew> we mangle them (which would cause them to exist, but which would render
Andrew> them even less readable)?

Andrew, it is not the matter of what comes as wire.
I thought you understand what we were talking about....


It's FILE SYSTEM CHARSET that matters. It's FILE SYSTEM CHARSET or
UNIX CHARSET and INTERNAL CHARSET. I thought that's what we've been
talking about for this two days.
# or is it three days for you?

And that is because You keep on saying we have to use UTF-8 as FILE
SYSTEM CHARSET.


Customers are not new comer. Customers have bunches of resource
they've already created. Some of them were even unix user, who had
dialect CP932 on their system. They've moved to Windows 95, with
their dialect CP932.

And they have to have those files on new Samba servers too. And due
to dialect, even if you see one character code, we can't really tell
what the hell that character's glyph really is.

If that CP932 file is being created on MS-NEC dialect, it should
mean glyph X, if on MS-IBM dialect, it should mean glyph Y.  Both X
and Y glyphs may be on Unicode, but we can not tell which dialect
they are using. WE CAN NOT! There is no description of which dialect
they are using. Just like Samba's internal string does not have
information of what charset they are holding.

So, nobody can really say there will be no problem moving FS charset
to UTF-8. And unless it is really safe, or unless they, themselve,
accept those incompatibility, they will not change FS charset to UTF-8.


If we can not move customer's FS charset to UTF-8, we have to face
that FS charset may not be ( and unfortunately, usually not ) UTF-8.
And we have to start from here.

                      FS CHARSET ARE NOT UTF-8.
	         WE CAN NOT FIX FS CHARSET TO UTF-8.
        SO SAMBA HAVE TO TREAT FS CHARSET THAT ARE NOT UTF-8.


Now let's get back to what we should use as Samba internal.

" DO YOU STILL BELIEVE WE SHOULD USE unix charset AS internal code?
  OR DO YOU THINK FIXING UTF-8/UTF-16 AS internal code NO MATTER
  WHAT unix charset MAY BE, IS BETTER? "


I agree that Unicode is far better to do internal treatment of what
Samba need to do. But if you say we have to use unix charset as
internal code, YOU ARE FORCING US NOT TO USE UNICODE!! That's
what'll happen, and that's what I'm claiming about.

	!WE WANT SAMBA INTERNAL CODE TO BE FIXED TO UNICODE!
	        !NO MATTER WHAT UNIX CHARSET MAY BE!

I thought you are understanding this point...

regards,
---- 
Kenichi Okuyama

P.S. Use UTF-16 is better than UTF-8, but that's another story now.



More information about the samba-technical mailing list