Internationalisation issues in Samba

Michael B Allen mba2000 at ioplex.com
Tue Feb 17 00:29:08 GMT 2004


Shiro Yamada said:
> Two main sources of errors may be:
> - Not allocating enough buffer space
> - Not considering MB characters embody ascii code equivalents

Which can both be avoided by following the following general rule about
mutli-byte character encodings:

	It is not correct to calculate the number of characters in the
	string by subtracting a pointer to the beginning of the string
	from a pointer to the end.

Regarding the first item, iconv will generally take care of this. The
catch is you just have to provide a buffer that is large enough. The size
in bytes of the resulting string is the distance the outbuf parameter
to iconv was advanced. If you need to determine the precise size of the
resulting string in advance (e.g. to malloc minimal space for the string)
then using iconv is less than ideal.

Regarding the second item, when performing a literal iteration over
a string it is conceptually superior to work on characters and not
bytes. For this reason it is usually necessary to convert each multibyte
sequence to a Unicode character before examining it in an iteration. IMO
the best way to handle this problem is to just run the process performing
the string operation in the LC_CTYPE locale of the encoding used and
convert each character to wchar_t directly within the loop performing
the logical operation. This assumes that wchar_t is Unicode of course.

> encoding. A good example is conversion to UTF-8, where UTF-8 composes a
single
> character with 3 byte for some language. If the source string is
constructed

Technically a UTF-8 sequence can be up to 6 bytes albeit highly unlikely.

> Different byte size for Upper and Lower case MB chars
> -----------------------------------------------------
>
> Some MB characters are case-sensitive, and they should be treated as the
> same characters. Fortunately Samba can identify that these two characters
> represent the same meaning. In fact, Samba can transpose two with functions
> strlower_m() and strupper_m().
>
> However, not many people are aware that, there are some encodings whose
> uppercase MB char differs in size with its lowercase equivalent. Therefore,
> if you don't allocate enough space for conversions, the resulting string
may
> overflow the buffer allocated after strupper_m()/strlower_m() operations.
>
> An example of characters having different size between upper and lower
> cases is, ROMAN NUMERALs in eucJP-ms.
>
>   SMALL ROMAN NUMERAL ONE: 3 bytes
>   ROMAN NUMERAL ONE:       2 bytes

Is this really common? It's painful to add the overhead of a corner
case such as this. Also, it is important to point out that UTF-8 does
not suffer from this phenomenon.

> +-------------------------------------------------------------------------+
> |                                                                         |
> |  char *orig = "\\<0xXX><0x5c>\0";   /* <0x5C><0xXX><0x5C> in hex */     |
> |  char[128] buff;                                                        |

You have to use unsigned char[] here. A C implementation is permitted
to fault if you do not.

> The proper way to solve this is to convert the string to UCS2 first so
there
> is no risk of having ascii code equivalent, and then perform substitution
> on the basis of UCS2, and finally convert it back to the original string.
>
> +-------------------------------------------------------------------------+
> |                                                                         |
> |  char *orig = "\\<0xXX><0x5c>\0";   /* <0x5C><0xXX><0x5C> in hex */     |
> |  char[128] buff;                                                        |
> |  smb_ucs2_t *before, *after;                                            |
> |                                                                         |
> |  memset(buff, '\0', sizeof(buff));                                      |
> |  safe_strcpy(buff, orig, strlen(orig)+1);                               |
> |                                                                         |
> |  size = push_ucs2_allocate(&before, buff);                              |
> |  if (size < 0) {                                                        |
> |       ... ERROR HANDLING ...                                            |
> |  }                                                                      |
> |                                                                         |
> |  after = all_string_sub_wa(before, "\\", "/");                          |
> |                                                                         |
> |  memset(buff, '\0', sizeof(buff));                                      |
> |  size = pull_ucs2(NULL, buff, after, sizeof(buff), -1, STR_TERMINATE);  |
> |  if (size < 0) {                                                        |
> |       ... ERROR HANDLING ...                                            |
> |  }                                                                      |
> |                                                                         |
> |  SAFE_FREE(before);                                                     |
> |  SAFE_FREE(after);                                                      |
> |                                                                         |
> +-------------------------------------------------------------------------+

It would be considerably faster to run the process currently handling the
request in the negotiated locale and use mbrtowc in a loop to convert each
character to wchar_t to perform the '\\' test and '/' replacement. That
way you don't have to allocate extra buffer or convert the string back.

Again, this is really all about properly iterating over *characters*
and not bytes.

Another problem with multibyte encodings is that many of the ctype
functions do not work as expected. You mentioned toupper and tolower but
functions like isprint only work with ascii. Also consider that with glibc
at least strcmp cannot be used either -- you must use strcoll instead. There
are several other such cases.

Mike




More information about the samba-technical mailing list