i18n question.

Michael B Allen mba2000 at ioplex.com
Mon Mar 8 00:19:05 GMT 2004


TAKAHASHI Motonobu said:
>>I would like to know more about these "trouble-some characters". Can you
>>provide a link in english that describes in detail a case where converting
>>to Unicode fails do to inadequte charset or character encoding support?
>>Please provide a byte sequence in CP932 that does not map to Unicode.
>
> |Read CJKV... That's almost only information source I can give you in
> |English ( yes English is the biggest barrier you're giving me ).
>
> FYI. these URLs will probably help you:
>
> http://www.debian.or.jp/~kubota/unicode-symbols-map2.html.en
> http://www.miraclelinux.com/english/technet/samba30/index.html
> http://www.miraclelinux.com/english/technet/samba30/iconv_issues.html

This was very informative. So even though JIS X 0208 is the charset used for EUC-JP and
Shift-JIS, using the Unicode mapping tables for either EUC-JP or Shift-JIS can result in
different Unicode values.

For example, if you have a legacy filename like ‘my2¢’ the ‘¢’ character might be encoded as
0x91 0x81 in Shift-JIS on disk. Converting this with a filename conversion tool that uses the
Unicode supplied mapping table (e.g. a mainstream glibc tool) would map this character to
U+00A2 (which in UTF-8 on disk would be 0xC2 0xA2). But after converting the filename, a MS
client will fail to find the file because MS maps ‘¢’ using it’s CP932 mapping to get U+FFE0.
So U+FFE0 will be received on the wire, U+00A2 is read from disk resulting in a mismatch. Is
this the sort of problem you’re describing?

However, provided the CP932 mappings were "corrected" (e.g. using libiconv-1.9.1 pluglin) then
the utility would correctly convert ‘¢’ to U+FFE0. Then you have Unicode on the wire
(UCS-2LE/UTF-16LE), Unicode internally (UTF-8), and Unicode on Disk (UTF-8). So at least in
this case there would be no mapping problems. It's only when you want to use a legacy encoding
on disk for filenames that the internal string handling results in problems (e.g. 2nd byte
ASCII violates one of The Sanity Rules).

Mike



More information about the samba-technical mailing list