i18n question.

Sat Mar 6 10:42:49 GMT 2004

tridge at samba.org wrote:
>UTF-8 is less of a nightmare than any other solution. 

I agree "less", but fixed length character set is "better" than
variable length charset.

By the way, our (my and prolably Kenichi and Shiro's) point is that:

(1) Samba should fully support several unix charsets (such as CP932
  and EUC-JP) for the present.

  Optimization for ASCII and UTF-8 is OK, but should consider "unix
  charset" is variable, not UTF-8 only.

(2) Seperate internal charset from "unix charset"

  As I mentioned in (1), "unix charset" should be variable, but
  "variable" is not good for coding and administrating.
  Defining "fixed" internal charset, these issues will be solved.

(3) Suggest UCS-2 as the "internal charset"
  The internal charset should be any of Unicode.
  Currently UCS-2 is better that UTF-8, because UCS-2 is a charset
  sent from Windows.

---

I think you will agree (1).

For (2),

|I'd also like to point out why we will not separate the internal
|charset from the filesystem charset. For each SMB request we make
|more than one system call (ranging from 2 or 3 up to hundreds in some
|cases). This "multiplier effect" is what makes converting to the
|filesystem charset at the wire boundaries so advantageous. 

At the view of performance, separating internal charset and set it as
the same as Windows (currently UCS-2) will also keep performance.

Current Samba 3.0:
  Windows -(Convert)-> Samba           -> Filesystem
    UCS-2              Unix charset       Unix charset

Our suggestion:
  Windows -----------> Samba -(Convert)-> Filesystem
    UCS-2              UCS-2              Unix charset

On the other hand, separating internal charsets, writing code for
manipulation is easy to debug as Kenichi said. This is big merit.

And if you set "unix charset" other than Unicode series (such as
UTF-8), string comparison will be faster and easier.
Currently, before/after string comparison, "unix charset" chars are
converted to/from UCS-2, this is expensive.

For (3),

Simply I suggest using same charset as Windows uses on the wire.
Currently I think Windows still uses UCS-2 on the wire, so suggest
UCS-2 as Samba internal charset. UTF-16 is also welcome.

UTF-16 is not fixed length, but as you know it's easier to handle
programs than UTF-8 and more and more easier than legacy Japanese
charsets.

Also, I suggest 

(4) always writing to files with UTF-8. To do this,

(a) easily to migrate UTF-8 from legacy encodings. Currently I think
  we must convert charset (or re-create) in most files when changing
  unix charset. 

(b) we can change "unix charset" from SWAT.
  Currently we convert smb.conf and other files's encoding when
  changing "unix charset" but currently no way to do this.

(c) easily read than UCS-2(UTF-16) and if we must read the log files
  written in foreign language's, we need not think in what encoding
  this is written.

So I think

Current Samba 3.0:
  Windows -(Convert)-> Samba           -> Filesystem
    UCS-2              Unix charset       Unix charset
                        |
                       files (smb.conf)
                       Unix charset

My suggestion:
  Windows -----------> Samba -(Convert)-> Filesystem
    UCS-2(UTF-16?)     UCS-2(UTF-16?)     Unix charset
                        |
                       files (smb.conf)
                       UTF-8

-----
TAKAHASHI, Motonobu (monyo)                    monyo at home.monyo.com
                                               http://www.monyo.com/