Internationalisation issues in Samba

Mon Feb 16 10:36:16 GMT 2004

On Mon, 2004-02-16 at 16:36, Shiro Yamada wrote:
> Hi list,
> 
> I've been working on verifying whether Samba 3 operates properly with
> multibyte characters or not, for the last couple of months. While most
> of its functions did work as their intentions, there were some cases
> where the functionalities were broken.
> 
> I've jotted down characteristics of MB chars and how to write deal
> with them so that for those who are not familiar to MB programming
> could refer to it in future. Please have a look at it, and if you think
> it is acceptable, would you consider taking it into CVS as a part of
> Samba developers guide? Thank you.

> Two main sources of errors may be:
> - Not allocating enough buffer space
> - Not considering MB characters embody ascii code equivalents

> From the above example, the value of `size' or strlen(out_buf) is not
> necessarilly same as strlen(in_buf). You need to allocate an extra room for
> the converted string to prepare for the expansion.

We use 'convert_string_allocate' where we have non-fixed buffers to
convert into.  Where we use fixed buffers (such as the fstring/pstring)
then this is always a limit, even without multibyte...

> Different byte size for Upper and Lower case MB chars
> -----------------------------------------------------
> 
> Some MB characters are case-sensitive, and they should be treated as the
> same characters. Fortunately Samba can identify that these two characters
> represent the same meaning. In fact, Samba can transpose two with functions
> strlower_m() and strupper_m().
> 
> However, not many people are aware that, there are some encodings whose
> uppercase MB char differs in size with its lowercase equivalent. Therefore,
> if you don't allocate enough space for conversions, the resulting string may
> overflow the buffer allocated after strupper_m()/strlower_m() operations. 
> 
> An example of characters having different size between upper and lower
> cases is, ROMAN NUMERALs in eucJP-ms.
> 
>   SMALL ROMAN NUMERAL ONE: 3 bytes
>   ROMAN NUMERAL ONE:       2 bytes

For this reason, we have strdup_upper(), and strdup_lower()

> 
> Characters containing ascii code equivalent
> -------------------------------------------
> 
> Let me ask you this simple question first: how do you differentiate a MB
> charcter with two oridinary ascii characters? The answer is, they start with
> a code which no ascii character have. If you refer to `man ascii', you would
> see the list of each ascii character, with its corresponding code point.
> 
> However, things are quite different for the second byte or onwards. If an
> encoding is well-designed so that no ascii code (code point used by an ascii
> character) appears in it, then there would be no problem. Unfortunately, not
> all the encodings are like that. They contain embodies some ascii codes in
> some of their characters.
> 
> Below is the list of CJK encodings, with ascii code they embody. This
> research is dedicated to the CJK encodings only, but your contribution
> is welcome.
> 
> + CP932 (Japanese)
> 
>   Some CP932 characters contain ascii code equivalents in their 2nd byte,
>   ranging [40,7E].
>   
>     [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

This makes the character set incompatible with Samba, as a unix
charset.  One of the 'rules' about Samba's unix charset support is that
it must not contain ascii in secondary bytes.  Otherwise, we cannot
correctly use our fast paths.

> + UHC (Korean)</span>
> 
>   Some UHC characters contain ascii code equivalents in their 2nd byte,
>   ranging [41,5A], and [61,7A].
> 
>     [41,5A] = ABCDEFGHIJKLMNOPQRSTUVWXYZ
>     [61,7A] = abcdefghijklmnopqrstuvwxyz
> 
> + GB18030 (Simplified Chinese)
> 
>   Some Big5 characters contain ascii code equivalents in their 2nd byte,
>   ranging [40,7E]. Also, some characters might contain ascii codes [30,39]
>   in their 4th byte.
> 
>     [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>     [30,39] = 0123456789
> 
> + Big5 (Traditional Chinese)
> 
>   Some Big5 characters contain ascii code equivalents in their 2nd byte,
>   ranging [40,7E].
>   
>     [40,7E] = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
> 
> 
> Problems with MB characters containing ascii codes
> --------------------------------------------------
> 
> From the result above, two questions should be aparent:
> 
> 1. Whether a special ascii (such as '\') could be identified as a part of
>    a MB character or not

Any charset that contains \  (and other ascii characters) outside ascii
is just broken, and samba's breakage is the least of the problems...

> Suppose there exist a MB character <0xXX><0x5C>, where each <> represents
> a single byte and <0xXX> represents an arbitary code.  
> <0x5C> is equivalent to the code point of ascii `\'.
>      
> Then, you are provided with a string "\<0xXX><0x5C>" or "<0x5C><0xXX><0x5C>".
> What would you like is to replace the prior `\' to '/', but leave the latter
> `\' unchanged. (for your reference, the code point for '/' is <0x2F>).
> The resulting value we should expect is <0x2F><0xXX><0x5C>. 
> 
> A bad example may be to call all_string_sub() function and passes the testing
> string "<0x5C><0xXX><0x5C>" directly to it. all_string_sub() takes three
> argument, first argument is the string data to be manipulated, second being
> the matching pattern that would be replaced, and the third being the replacing
> string.
> 
> +-------------------------------------------------------------------------+
> |                                                                         |
> |  char *orig = "\\<0xXX><0x5c>\0";   /* <0x5C><0xXX><0x5C> in hex */     |
> |  char[128] buff;                                                        |
> |                                                                         |
> |  memset(buff, '\0', sizeof(buff));                                      |
> |  safe_strcpy(buff, orig, strlen(orig)+1);                               |
> |  all_string_sub(buff, "\\", "/");                                       |
> |                                                                         |
> +-------------------------------------------------------------------------+
> 
> After calling all_string_sub(), the content of "buff" will be <0x2F><0xXX><0x2F>,
> and this is different to our original intention. The MB character is malformed
> and Samba is no longer able to use the correct string data.
> 
> The proper way to solve this is to convert the string to UCS2 first so there
> is no risk of having ascii code equivalent, and then perform substitution
> on the basis of UCS2, and finally convert it back to the original string.

Sure, that's the 'right thing'.  It's also *very* expensive.  The reason
that Samba's default, and recommended charset is UTF8, is that these
problems were never introduced into UTF8 (the history of UTF8, and the
UTF that proceeded it is well worth reading).

Andrew Bartlett

-- 
Andrew Bartlett                                 abartlet at pcug.org.au
Manager, Authentication Subsystems, Samba Team  abartlet at samba.org
Student Network Administrator, Hawker College   abartlet at hawkerc.net
http://samba.org     http://build.samba.org     http://hawkerc.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.samba.org/archive/samba-technical/attachments/20040216/789acd06/attachment.bin