i18n question.

Sat Mar 6 23:48:11 GMT 2004

Kenichi Okuyama said:
>>> This includes moving from old system to new system.
>>> Since file system ( and tar format ) does not care about
>>> character set, when we make backup, they do no character
>>> conversion. And many system administrator do not wish to try
>>> conversion
> Michael> There are many good tools for converting entire filesystems from
> one
> Michael> encoding to another. Moving foward this will be the correct
> solution.
>
> Not always.
>
> Unicode do not fully contain what we had in CP932 nor EUC nor
> JIS. There is 'machine dependent' characters which causes trouble.

So you're claiming you cannot map these to Unicode? If so, then you cannot
use Windows fireservers with Unicode? Do you run entierly in a SHIFT-JIS,
EUC-JP, or CP932 locale?

>>> ( I must say that Unicode do not REALLY fullfil
>>> other Japanese character encoding. It is rarely used, but
>>> most of admin do not wish to bet on their luck ).
> Michael> Why would it be a matter of "luck"? If an illegal character
> sequence is
> Michael> encountered the conversion utility should report an error. I
> don't beleive
> Michael> this as serious as you claim either. If you cannot represent your
> Michael> filenames with Unicode you have a much bigger problem.
>
> Currently running system is NOT the ONE AND ONLY thing we need to
> face.
>
> There are backups, there are old datas, which may have chance of
> such trouble-some characters.

So add conversion to your restore procedures.

I would like to know more about these "trouble-some characters". Can you
provide a link in english that describes in detail a case where converting
to Unicode fails do to inadequte charset or character encoding support?
Please provide a byte sequence in CP932 that does not map to Unicode.

> BUT! when the day comes where moving to UTF-8 is THE ONLY CHOICE, it
> is not IT people's fault to MOVE. It is not IT people's fault of
> asking marketing, sales, development and other organization to
>
> "stop using those troublesome characters"
> "you can't use that character anymore"
> "you will not be able to access to such filename"

This sounds like more of a logistical problem rather than a software one.

>
> The reason comes from outside world. NOW THEY HAVE ENOUGH EXCUSE TO
> FORCE SUCH RULE.

No one is forcing you to do anything. You can continue to use the systems
you have. But IMHO you SHOULD force yourselves to establish a plan and
mandate it country wide.

> Michael> I don't agree. The optimal encoding is the filesystem encoding
> and the most favorable filesystem encoding is UTF-8.
>
> Reason?
>
>
> Well what I mean as "Reason?" is.
>
> I don't agree with you about "optimal encoding is the filesystem
> encoding", because I have seen many filesystem using CP932 which
> have problem over '\' treatments.
> # As you know '\' have special meaning for C, which time after time
> # is causing trouble, not only in Samba, but for other programs too.

If you're iterating over each character in a multibyte sequence it will be
necessary to convert each to a wide character. UTF-8 just happens to be
designed such that you don't have to do this to search for ascii
characters.

> I do agree with you that one of the most favorable filesystem
> charset is UTF-8, but FS charset is not something we can have
> control over.

UTF-8 is favorable because it is a Unicode encoding that can be used
directly with the filesystem api. My understanding was that Japanese could
be adequately represented using Unicode. I would very much like to see
specific examples where this is not true. Please provide a link so that I
can educate myself. I sincerely want to make my C projects as accessible
as I can.

> Hence, I believe:
> 1) we need internal code to be independent from filesystem charset.
>    Even if we are not using UTF-8 as filesystem charset, we do not
>    want to face those charsets running around inside smbd/nmbd.

If you must continue to use EUC-JP, SHIFT-JIS and CP932 then I recommend
creating a string abstraction that parameterizes all of the string
operations that need to consider the charset or encoding. You could then
link against a different set of routines to get differnet character
behavior without changing any of the string routines.

For example create the common denominator routines and make the necessary
changes throughout the code.

typedef unsigned char tchar;

size_t utf8_size(const tchar *src, const tchar *slim);
size_t utf8_copy(const tchar *src, const tchar *slim,
                tchar *dst, tchar *dlim, int n);
int utf8_next(const tchar *src, const tchar *slim);
int utf8_compare_caseless(const tchar *s1, const tchar *s1lim,
                         condt tchar *s2, const tchar *s2lim, int n);
int utf8_path_canon(const tchar *src, const tchar *slim,
                tchar *dst, tchar *dlim, int n);
int utf8_match_wild_caseless
utf8_tombs
...

#define str_size utf8_size
#define str_copy utf8_copy
#define str_next utf8_next
...

Start with just a stub implementation that supports what samba has now.
This can work just as quickly as what they have now. If you prove that you
might convince someone to accept the patch. Then you have a huge amount of
flexibility with different implementations of the abstraction. One of the
key functions will be str_next. For an 8bit encoding this could just be a
macro like:

#define str_next(s) (*(s))

For UTF-8 it would use mbtowc. For SHIFT-JIS you would have to perform the
necessary evaluation to return the next complete character (e.g. '\').

Just a thought.

> P.S. I believe Mike and I are looking at same Heaven's gate.
>      Only, Mike believes gate is right near by.
>      I believe what we have between gate and us is river.
>      So Mike says to run, I says to stop and build bridge.

You need to get a nationalized initiative to ferry your people over to the
side of the river that everyone else is on. I understand you're not happy
with the situation but Unicode is the future for the vast majority of us.

Mike