i18n question.

Sun Mar 7 16:21:40 GMT 2004

Hello all,
For what I've seen of this thread it seem we are not agreeing on basic
resonings on the choice of both the points of view.

I would like to make a set of questions to the
Japanese-Encodings-Problems-Party out there and try to explain some of
the decisions taken on the "occidental" side of the team.

Please don't be picky about some terms, corrctions are welcome.

First of all let's try to define the problem.
In this phase I would like to ask some questions.
Please if you want to reply to this message, than first answer precisely
to my questions. If the question sounds ambiguos to you, then please do
not skip it but ask me back which are your councerns.

I know this seem a bit pedantic way to move on, but our are _basically_
comprehension problems, I say A, another understand B and say C, the
third see C and screem that B is not A and so on ...

Questions:

1. How do you currently see problematic chars (written on the FS with
EUC-JS or CP938) under windows? Do your windows workstations use UCS2 or
are they modified to use something else?

2. I assume UCS-2 is used in windows clients. Given that, what
connversion problem will ever matter? If you have problem with unicode
you have the problem in any case as the client have it and there's
nothig we (samba) can do, Am I right?

3. At one point someone said that in Japan some system make assumptions
about the character set used inside system library, can anyone confirm
this? (with links to docs if possible)

Clarifications (hope so):

1. A tsome point I've seen this little scheme:

Current Samba 3.0:
  Windows -(Convert)-> Samba           -> Filesystem
    UCS-2              Unix charset       Unix charset

Our suggestion:
  Windows -----------> Samba -(Convert)-> Filesystem
    UCS-2              UCS-2              Unix charset

And later on:

  Windows -----------> Samba -(Convert)-> Filesystem
    UTF-16             UTF-16             Unix charset

  Windows -(Convert)-> Samba -(Convert)-> Filesystem
    UTF-16             UTF-8              Unix charset

I would like to explain why the 1st scheme is ok and the last 3 one are
not.

First of all, I would like to se we have tought about these problems a
lot in the team, I remember speaking with Tridge exactly about the
possibility of keeping both UCS-2 and unix charset string in the same
structure at some point and rewriting all our functions to keep both the
strings up to date during all manipulation (this was after CIFS 2001
believe or during CIFS 2002). We discussed many possibilities, like UCS2
internally and conversion at barriers and such. All were tempting but
all finally where trashed as unsuitable.

First of all we have a C-library constraint:
- all strings can contain just everything except '\0' chars.

Second: much of the samba code assumes that.
- changing this would require _consistent_ effort and many troubles.

Third: we have real performance issues trying to convert on the fly at
the unix side instead of windows side.
This is one of the most important and I would like to show you up why.
Le'ts take back the schema used to simplify character flows:

  Windows -----A-----> Samba -----B-----> Filesystem

Now this schema is a liar, and making assumption on this schema is the
source of many problems. Thinking with this schema in mind may let you
think that making the conversion at frontier 'A' is the same as making
conversions at the frontier 'B'. This is plain _wrong_!

The right schema is something like this:

               A                  B
               |                  |
  Windows -----------> Samba -----------> Filesystem (1)
                             -----------> Filesystem (2)
                                   .
                                   .
                             -----------> Filesystem (N)

so the cost of A != B
cost of B is multiplied N times.

This is because often a single SMB operation require a variable number
of filesystem operations that may be very high.

So translating at point A is not the same as translating at point B.

At this point someone can object that we may solve B issues by using a
conversion cache so that at B we always have both the so called
"internal charset" and the "filesystem charset" on a same structure and
carry them on.

But let's avoid oversimplification once again as we would incur in other
big penalties:
- you have to build up a library of functions that are able to always
keep in sync strings once you modify them, this means often that you
make operations on UCS-2 (ucs-4, whatever) and then have to call iconv
to keep in sync the "filesystem charset"-string and this may reintroduce
a lot of conversions in some code paths that would unacceptably slow
down samba once again (to the point of being unusable).
- you have to recode all the samba code base probably reintroducing lot
of bugs, it is a huge work and require _very_ strong motivations to be
worth of.

So based on this motivations we decided to stick with the used unix
charset inside samba. I said "unix charset" and NOT utf8. Yes you
understand it correctly we are not going to make utf8 a requirement for
samba, not at all. We prefer to manage strings in an opaque mode and any
code that assumes a string is ascii, utf8, or anything else is a bug
that need to be fixed !!!

Ok now given that we made the right decision (I'm not saying it is the
right one, I'm saying we assumw it is, if you can prove the countrary we
will be happy to listen at you and talk of the consequencies), then I
would like to ask a last question:

- as we used the unix charset internally AND have to convert to
UTF16(UCS2) anyway on the wire, is there any conversion problem for
japanese *nix machines? Is there something we do not understand about
japanese->unicode conversion that may affect samba operations (and that
does not occur into windows clients)? 

I assume there is no problem as UCS2 has been proposed as "internal
charset" so I assume conversions to and from unicode are ok.

Now let's return to optimization problems, that will never assume we are
handling utf8 and that probably should be replaced with a module based
facility so that each country can add it's own optimizations easily
based on the unix charset they choose to use.

regards,
Simo.

-- 
Simo Sorce - simo.sorce at xsec.it
Xsec s.r.l. - http://www.xsec.it
via Garofalo, 39 - 20133 - Milano
mobile: +39 329 328 7702
tel. +39 02 2953 4143 - fax: +39 02 700 442 399