i18n question.

Sun Mar 7 21:50:08 GMT 2004

Kenichi,

I don't think you CCd your reply to the mailing list (or anyone else
for that matter). From the content I think you clearly meant everyone
to see it, so I hope you don't mind me replying publicly.

 > Also, let me concentrate the list on why we want to have "internal
 > charset", and not to mention about WHAT CHARSET to use. So, I will
 > not talk about performance for now.

ok

 > 1) Reliability of code
 >    If we can have internal charset, and everything is being treated
 >    in single charset until very last moment, We only need to
 >    maintain program for that internal charset only.
 ...

this argument doesn't convince me. As long as we stick to the "rules"
as to what we can assume about a internal charset then we will be
OK. Then we just need good test suites, including tests that choose
unusual charsets. 

 > 2) Can manipulate two or more unix charset on single Samba
 > 
 >    Let's suppose we have two file system. Old one uses EUC, new one
 >    uses UTF-8. This can be easily done simply by mounting two
 >    storage. And assume that Samba can treat two unix charset at same
 >    time. We want people to move from EUC to UTF-8, but we do not
 >    want to be blamed due to file name inconsistency. Now we can say
 >    to our customer:
 ...

This is a good reason. What I'm not completely sure of is whether it
is a good enough reason to go through the pain of separating internal
and filesystem charsets. It's a lot of pain, so there needs to be a
lot of payoff to do it!

We could cope with this problem in other ways. For example, you could
have two server aliases for the same machine, and have different
configurations for those two aliases. Then you export the shares that
use one charset under one alias and the shares that use a different
charset under a different alias. That isn't as convenient, but it does
avoid the problem.

Let's look for a moment as what we would have to do to cope with two
different charsets on different shares.

 *) What would we do with smb.conf parameters? Are they in a different
  charset per share? What do we do for global smb.conf parameters?
  What about when global parameters are inherited by a share?

 *) What charset do we assume for C library calls like getpwnam()?

 *) What charset do we pass to external programs like our script
  hooks?

I'd like to propose a compromise instead. In Samba4 we have a much
cleaner separation between frontend and backend than we have in
Samba3. This separation is achieved via the NTVFS layer. I would like
to propose that we do this:

 *) assume "internal charset" == "unix charset", like we do now

 *) build a "charset translation" NTVFS module that can be used in
  those less common cases where you wish to use a different charset
  for some shares.

 *) the "charset translation" module would be very small, and would
  take one parametric parameter per instance. That parameter would say
  what charset to translate to. 

 *) the module would be a pass-thru module, so you would configure it
  along with any other modules you define for the share, and it would
  filter requests on the way through (just like a anti-virus or audit
  module).

 *) the module would have a performance penalty, but that penalty
  would only be paid by shares that use a charset that is not the same
  as the global unix charset for the server. You can use whatever
  fancy cache schemes you like to try to reduce this performance
  penalty if you think it is worthwhile.

So we would have a share something like this:

[legacy]
	ntvfs module = charset-translate
	translate:charset = EUC
	path = /legacy

Does that sound OK?

On another topic, I know you also would like to use UTF-16 as the
internal charset. One of the main problems with doing that is the fact
that many C compilers on unix systems only support null terminated
8-bit string constants. On windows they have compiler support for wide
characters but we don't on unix. Without that compiler support it is a
nightmare dealing with all of the string constants we have to deal
with in CIFS. Do a search in Samba for all string constants, and think
about how you would convert all those to UTF-16 within the bounds of
ANSI-C. That's one of the main reasons why I want the internal charset
to be ascii-compatible.

Cheers, Tridge