Internationalisation issues in Samba

Mon Feb 16 05:36:30 GMT 2004

Hi list,

I've been working on verifying whether Samba 3 operates properly with
multibyte characters or not, for the last couple of months. While most
of its functions did work as their intentions, there were some cases
where the functionalities were broken.

I've jotted down characteristics of MB chars and how to write deal
with them so that for those who are not familiar to MB programming
could refer to it in future. Please have a look at it, and if you think
it is acceptable, would you consider taking it into CVS as a part of
Samba developers guide? Thank you.

Regards,

--
Shiro Yamada
shiro at miraclelinux.com

-------------- next part --------------
              Internationalisation Issues in Samba
              ====================================

Shiro Yamada <shiro at miraclelinux.com>

Introduction
------------

There is a significant change from Samba 2.2.x to Samba 3.0.x with regards
to native language support. Samba 2.2.x provides support for a single
language locale through a mechanism called codepages. Newer Samba 3.0.x
supports multiple locales with a single binary.

The basic idea for supporting multiple languages is to use external library
function called iconv, and let it do all the hard works. Samba just passes a
string to be converted, and receives the converted string upon the completion.

Howeever, while iconv is convenient way to facillitate support for many
languages at once, there is few concerns developers have to take care of.
If Samba does not handle MB strings in a right manner, no matter the string
is converted correctly via iconv, 

Two main sources of errors may be:
- Not allocating enough buffer space
- Not considering MB characters embody ascii code equivalents

This document describes the characteristics of MB encondings and associated
problems with them.

Resources to be considered
--------------------------

Under Microsoft Windows, MB characters can be used almost everywhere.
If the ultimate objective of Samba be providing functionality of which is
equivalent to those of Windows, Samba should assume these fields contain
MB characters. Windows are capable of dealing with MB characters in the
following resources.

+ File names
+ Directory names
+ (Ordinary) Share names
+ Printer Share names
+ User names
+ Group names
+ NetBIOS names
+ Workgroup names

There are plenty more not listed above, but these are probably the most
important elements. Take special care if your code is dealing with these
components.

Different size in byte among encodings
--------------------------------------

As you may or may not be aware of, there are several encondings per language
in the world of MB characters. Samba is flexible enough to provide support
for many encodings. It is capable of converting messages to Windows-compatible
encoding (dos charset) when talking to Windows clients. 

It is worth noting, however, that there is a case where a character in one
encoding has a different byte size to the same character of the different
encoding. A good example is conversion to UTF-8, where UTF-8 composes a single
character with 3 byte for some language. If the source string is constructed
with 2 byte characters, then the resulting string after the conversion may
be longer than the original. 

+-------------------------------------------------------------------------+
|                                                                         |
|         :                                                               |
|  size = convert_string(CH_UNIX, CH_UTF8, in_buf, in_bytes,              |
|                        out_buf, sizeof(out_buf));                       |
|  out_buf[size] = '\0';                                                  |
|         :                                                               |
|                                                                         |
+-------------------------------------------------------------------------+

From the above example, the value of `size' or strlen(out_buf) is not
necessarilly same as strlen(in_buf). You need to allocate an extra room for
the converted string to prepare for the expansion.

Different byte size for Upper and Lower case MB chars
-----------------------------------------------------

Some MB characters are case-sensitive, and they should be treated as the
same characters. Fortunately Samba can identify that these two characters
represent the same meaning. In fact, Samba can transpose two with functions
strlower_m() and strupper_m().

However, not many people are aware that, there are some encodings whose
uppercase MB char differs in size with its lowercase equivalent. Therefore,
if you don't allocate enough space for conversions, the resulting string may
overflow the buffer allocated after strupper_m()/strlower_m() operations. 

An example of characters having different size between upper and lower
cases is, ROMAN NUMERALs in eucJP-ms.

  SMALL ROMAN NUMERAL ONE: 3 bytes
  ROMAN NUMERAL ONE:       2 bytes

Characters containing ascii code equivalent
-------------------------------------------

Let me ask you this simple question first: how do you differentiate a MB
charcter with two oridinary ascii characters? The answer is, they start with
a code which no ascii character have. If you refer to `man ascii', you would
see the list of each ascii character, with its corresponding code point.

However, things are quite different for the second byte or onwards. If an
encoding is well-designed so that no ascii code (code point used by an ascii
character) appears in it, then there would be no problem. Unfortunately, not
all the encodings are like that. They contain embodies some ascii codes in
some of their characters.

Below is the list of CJK encodings, with ascii code they embody. This
research is dedicated to the CJK encodings only, but your contribution
is welcome.

+ CP932 (Japanese)

  Some CP932 characters contain ascii code equivalents in their 2nd byte,
  ranging [40,7E].

    [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

+ UHC (Korean)</span>

  Some UHC characters contain ascii code equivalents in their 2nd byte,
  ranging [41,5A], and [61,7A].

    [41,5A] = ABCDEFGHIJKLMNOPQRSTUVWXYZ
    [61,7A] = abcdefghijklmnopqrstuvwxyz

+ GB18030 (Simplified Chinese)

  Some Big5 characters contain ascii code equivalents in their 2nd byte,
  ranging [40,7E]. Also, some characters might contain ascii codes [30,39]
  in their 4th byte.

    [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    [30,39] = 0123456789

+ Big5 (Traditional Chinese)

  Some Big5 characters contain ascii code equivalents in their 2nd byte,
  ranging [40,7E].

    [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Problems with MB characters containing ascii codes
--------------------------------------------------

From the result above, two questions should be aparent:

1. Whether a special ascii (such as '\') could be identified as a part of
   a MB character or not

Suppose there exist a MB character <0xXX><0x5C>, where each <> represents
a single byte and <0xXX> represents an arbitary code.  
<0x5C> is equivalent to the code point of ascii `\'.

Then, you are provided with a string "\<0xXX><0x5C>" or "<0x5C><0xXX><0x5C>".
What would you like is to replace the prior `\' to '/', but leave the latter
`\' unchanged. (for your reference, the code point for '/' is <0x2F>).
The resulting value we should expect is <0x2F><0xXX><0x5C>. 

A bad example may be to call all_string_sub() function and passes the testing
string "<0x5C><0xXX><0x5C>" directly to it. all_string_sub() takes three
argument, first argument is the string data to be manipulated, second being
the matching pattern that would be replaced, and the third being the replacing
string.

+-------------------------------------------------------------------------+
|                                                                         |
|  char *orig = "\\<0xXX><0x5c>\0";   /* <0x5C><0xXX><0x5C> in hex */     |
|  char[128] buff;                                                        |
|                                                                         |
|  memset(buff, '\0', sizeof(buff));                                      |
|  safe_strcpy(buff, orig, strlen(orig)+1);                               |
|  all_string_sub(buff, "\\", "/");                                       |
|                                                                         |
+-------------------------------------------------------------------------+

After calling all_string_sub(), the content of "buff" will be <0x2F><0xXX><0x2F>,
and this is different to our original intention. The MB character is malformed
and Samba is no longer able to use the correct string data.

The proper way to solve this is to convert the string to UCS2 first so there
is no risk of having ascii code equivalent, and then perform substitution
on the basis of UCS2, and finally convert it back to the original string.

+-------------------------------------------------------------------------+
|                                                                         |
|  char *orig = "\\<0xXX><0x5c>\0";   /* <0x5C><0xXX><0x5C> in hex */     |
|  char[128] buff;                                                        |
|  smb_ucs2_t *before, *after;                                            |
|                                                                         |
|  memset(buff, '\0', sizeof(buff));                                      |
|  safe_strcpy(buff, orig, strlen(orig)+1);                               |
|                                                                         |
|  size = push_ucs2_allocate(&before, buff);                              |
|  if (size < 0) {                                                        |
|       ... ERROR HANDLING ...                                            |
|  }                                                                      |
|                                                                         |
|  after = all_string_sub_wa(before, "\\", "/");                          |
|                                                                         |
|  memset(buff, '\0', sizeof(buff));                                      |
|  size = pull_ucs2(NULL, buff, after, sizeof(buff), -1, STR_TERMINATE);  |
|  if (size < 0) {                                                        |
|       ... ERROR HANDLING ...                                            |
|  }                                                                      |
|                                                                         |
|  SAFE_FREE(before);                                                     |
|  SAFE_FREE(after);                                                      |
|                                                                         |
+-------------------------------------------------------------------------+

Because UCS2 is free from the problem of having ascii code equivalents, we
can safely replace a string with another string without having to worry whether
the string to be converted is the valid one of not. After the completion of
the conversion, you need to convert it back to the original string.

To summarise, always make sure which encodings you are working with. (i.e. you
do need to convert a string into unicode if necessary).

2. Whether a two MB character, one with a upper case ascii code in its
   second byte, and the other with a lower case ascii code in its second
   byte, can be identified as two independent characters or not.

Suppose <0xXX> is an arbitary hex code. 
If there exists a MB character <0xXX><0x53> (<0x53> represents `S' in ascii)
and another MB character <0xXX><0x73> (<0x73> represents `s' in ascii), these
two characters should be identified as two different characters.

This may pose problems to string comparison, where the conventional way of
performing string comparison is to standardise to either upper case or lower case
strings first and then compare the two. Because functions toupper() and tolower()
takes one byte a time, <0xXX><0x53> and <0xXX><0x73> could be converted into the
same byte sequence, hence string comparison succeeds even though they represent
two indepent MB characters.

This implies that toupper() and tolower() should not be applied to "unix
charset" directly, instead it should be converted to UCS2 and then use
strupper_w() or strlower_w().

These functionality should be encapsulated in libraries, and most of the
times referring functions should just call these library functions as the
mean of interfaces. Anyway, it is worth emphasising again that you should
NOT use toupper() or tolower().