i18n question.

Sun Mar 7 09:04:07 GMT 2004

On Sun, 2004-03-07 at 19:29, Kenichi Okuyama wrote:
> Dear Andrew,
> 
> First of all, I think I need to apologise for my bad English.  It
> seems like you have gone to wrong idea because of what I've been
> trying to explain.
> 
> 
> >>>>> "Andrew" == Andrew Bartlett <abartlet at samba.org> writes:
> Richard> Are you saying that UTF-8 does not encode all the Japanese glyphs that 
> Richard> Japanese people want to use?
> 
> Yes and No.
> 
> Yes they do not have all the glyphs we want to use.
> No, that's not the point I'm saying.
> 
> 
> Richard> This is a genuine question. I do not known the answer and it seems to me 
> Richard> like that is what you are saying. 
> 
> What I'm saying is,
> 
> 1) We can not move "unix FS charset" to UTF-8 easily.
>    Hence, what Andrew is proposing:
> 	"move unix FS charset to UTF-8 and we will have no problem
> 	 with mismatch of internal charset of UTF-8, and unix
> 	 charset of something else"
>    is impossible.

I made no such proposal.  I strongly oppose any idea that we should have
an 'internal charset' that is different to what we have on disk.  Read
all my mail before you reply.

I still consider that UTF8 appears to be the optimal encoding for
filenames that Samba must process, and strongly suggest that all systems
move to UTF8 as soon as administrativly possible.  We should support as
much as we can, but UTF8 will always be our preferred charset.  I am yet
to be given a 'good reason' why, with UTF16 as our wire charset, UTF8
support in Samba is insufficient for Japanese environments.

Naturally, legacy systems will always impede such a migration, but they
too will need to be fixed at some point.  (Samba is not the only
application this bites).

> Andrew> Does the failure to encode that Japanese glyph in UTF8 matter?
> 
> Ofcourse!!!! How do you feel if your system give you '?' against
> all the 'e' character? Andr?w?

Sure, but 'e' is a character windows also recognises.  If 'e' were not
recognised by the windows client, why would it matter that Samba could
use it, if it could never tell windows about it?

> Andrew> Against a Samba server, assuming we find characters that are in our
> Andrew> file-system that are not in UTF16, what should we do with them?  Should
> Andrew> we mangle them (which would cause them to exist, but which would render
> Andrew> them even less readable)?
> 
> Andrew, it is not the matter of what comes as wire.
> I thought you understand what we were talking about....
> 
> 
> It's FILE SYSTEM CHARSET that matters. It's FILE SYSTEM CHARSET or
> UNIX CHARSET and INTERNAL CHARSET. I thought that's what we've been
> talking about for this two days.

There is only 'unix charset'.  Anything else is asking for insanity...

> # or is it three days for you?
> 
> And that is because You keep on saying we have to use UTF-8 as FILE
> SYSTEM CHARSET.

Again, I strongly suggest all systems move to UTF8.  

> Customers are not new comer. Customers have bunches of resource
> they've already created. Some of them were even unix user, who had
> dialect CP932 on their system. They've moved to Windows 95, with
> their dialect CP932.
> 
> And they have to have those files on new Samba servers too. And due
> to dialect, even if you see one character code, we can't really tell
> what the hell that character's glyph really is.
> 
> If that CP932 file is being created on MS-NEC dialect, it should
> mean glyph X, if on MS-IBM dialect, it should mean glyph Y.  Both X
> and Y glyphs may be on Unicode, but we can not tell which dialect
> they are using. WE CAN NOT! There is no description of which dialect
> they are using. Just like Samba's internal string does not have
> information of what charset they are holding.
> 
> So, nobody can really say there will be no problem moving FS charset
> to UTF-8. And unless it is really safe, or unless they, themselve,
> accept those incompatibility, they will not change FS charset to UTF-8.

The same charset conversion that will occur for filesystem translation
occurs for *every* samba operation.  If there is an issue, then it will
be seen every day.  As such, this issue cannot be ignored, just because
the fs is not yet UTF8.

> If we can not move customer's FS charset to UTF-8, we have to face
> that FS charset may not be ( and unfortunately, usually not ) UTF-8.
> And we have to start from here.
> 
>                       FS CHARSET ARE NOT UTF-8.
> 	         WE CAN NOT FIX FS CHARSET TO UTF-8.
>         SO SAMBA HAVE TO TREAT FS CHARSET THAT ARE NOT UTF-8.

I have never objected to this.  I think we should support non-UTF8
filesystems, but I don't think we should rewrite all of samba to make it
'fast'.

> Now let's get back to what we should use as Samba internal.
> 
> " DO YOU STILL BELIEVE WE SHOULD USE unix charset AS internal code?
>   OR DO YOU THINK FIXING UTF-8/UTF-16 AS internal code NO MATTER
>   WHAT unix charset MAY BE, IS BETTER? "

NO.  

> I agree that Unicode is far better to do internal treatment of what
> Samba need to do. But if you say we have to use unix charset as
> internal code, YOU ARE FORCING US NOT TO USE UNICODE!! That's
> what'll happen, and that's what I'm claiming about.
> 
> 	!WE WANT SAMBA INTERNAL CODE TO BE FIXED TO UNICODE!
> 	        !NO MATTER WHAT UNIX CHARSET MAY BE!

This leads to madness.  See tridge's post on the system call multiplier
effect.  There are easier ways to solve this.

Andrew Bartlett

-- 
Andrew Bartlett                                 abartlet at pcug.org.au
Manager, Authentication Subsystems, Samba Team  abartlet at samba.org
Student Network Administrator, Hawker College   abartlet at hawkerc.net
http://samba.org     http://build.samba.org     http://hawkerc.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.samba.org/archive/samba-technical/attachments/20040307/39f4bcfe/attachment.bin