i18n question.
TAKAHASHI Motonobu
monyo at home.monyo.com
Sat Mar 6 10:42:49 GMT 2004
tridge at samba.org wrote:
>UTF-8 is less of a nightmare than any other solution.
I agree "less", but fixed length character set is "better" than
variable length charset.
By the way, our (my and prolably Kenichi and Shiro's) point is that:
(1) Samba should fully support several unix charsets (such as CP932
and EUC-JP) for the present.
Optimization for ASCII and UTF-8 is OK, but should consider "unix
charset" is variable, not UTF-8 only.
(2) Seperate internal charset from "unix charset"
As I mentioned in (1), "unix charset" should be variable, but
"variable" is not good for coding and administrating.
Defining "fixed" internal charset, these issues will be solved.
(3) Suggest UCS-2 as the "internal charset"
The internal charset should be any of Unicode.
Currently UCS-2 is better that UTF-8, because UCS-2 is a charset
sent from Windows.
---
I think you will agree (1).
For (2),
|I'd also like to point out why we will not separate the internal
|charset from the filesystem charset. For each SMB request we make
|more than one system call (ranging from 2 or 3 up to hundreds in some
|cases). This "multiplier effect" is what makes converting to the
|filesystem charset at the wire boundaries so advantageous.
At the view of performance, separating internal charset and set it as
the same as Windows (currently UCS-2) will also keep performance.
Current Samba 3.0:
Windows -(Convert)-> Samba -> Filesystem
UCS-2 Unix charset Unix charset
Our suggestion:
Windows -----------> Samba -(Convert)-> Filesystem
UCS-2 UCS-2 Unix charset
On the other hand, separating internal charsets, writing code for
manipulation is easy to debug as Kenichi said. This is big merit.
And if you set "unix charset" other than Unicode series (such as
UTF-8), string comparison will be faster and easier.
Currently, before/after string comparison, "unix charset" chars are
converted to/from UCS-2, this is expensive.
For (3),
Simply I suggest using same charset as Windows uses on the wire.
Currently I think Windows still uses UCS-2 on the wire, so suggest
UCS-2 as Samba internal charset. UTF-16 is also welcome.
UTF-16 is not fixed length, but as you know it's easier to handle
programs than UTF-8 and more and more easier than legacy Japanese
charsets.
Also, I suggest
(4) always writing to files with UTF-8. To do this,
(a) easily to migrate UTF-8 from legacy encodings. Currently I think
we must convert charset (or re-create) in most files when changing
unix charset.
(b) we can change "unix charset" from SWAT.
Currently we convert smb.conf and other files's encoding when
changing "unix charset" but currently no way to do this.
(c) easily read than UCS-2(UTF-16) and if we must read the log files
written in foreign language's, we need not think in what encoding
this is written.
So I think
Current Samba 3.0:
Windows -(Convert)-> Samba -> Filesystem
UCS-2 Unix charset Unix charset
|
files (smb.conf)
Unix charset
My suggestion:
Windows -----------> Samba -(Convert)-> Filesystem
UCS-2(UTF-16?) UCS-2(UTF-16?) Unix charset
|
files (smb.conf)
UTF-8
-----
TAKAHASHI, Motonobu (monyo) monyo at home.monyo.com
http://www.monyo.com/
More information about the samba-technical
mailing list