[linux-cifs-client] [PATCH 03/10] cifs: add replacement for cifs_strtoUCS_le called cifs_utf16le_to_host

Wed Apr 29 16:19:25 GMT 2009

On Wed, Apr 29, 2009 at 10:58 AM, Jeff Layton <jlayton at redhat.com> wrote:
> On Wed, 29 Apr 2009 10:40:52 -0500
> Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
>
>> On Wed, Apr 29, 2009 at 10:30 AM, Jeff Layton <jlayton at redhat.com> wrote:
>> > On Wed, 29 Apr 2009 10:26:40 -0500
>> > Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
>> >
>> >> On Wed, Apr 29, 2009 at 8:29 AM, Jeff Layton <jlayton at redhat.com> wrote:
>> >> > Add a replacement function for cifs_strtoUCS_le. cifs_utf16le_to_host
>> >> > takes args for the source and destination length so that we can ensure
>> >> > that the function is confined within the intended buffers.
>> >> >
>> >> > Signed-off-by: Jeff Layton <jlayton at redhat.com>
>> >> > ---
>> >> >  fs/cifs/cifs_unicode.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >  fs/cifs/cifs_unicode.h |    2 +
>> >> >  2 files changed, 123 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c
>> >> > index 7d75272..aafaf0d 100644
>> >> > --- a/fs/cifs/cifs_unicode.c
>> >> > +++ b/fs/cifs/cifs_unicode.c
>> >> > @@ -26,6 +26,127 @@
>> >> >  #include "cifs_debug.h"
>> >> >
>> >> >  /*
>> >> > + * cifs_mapchar - convert a little-endian char to proper char in codepage
>> >> > + * @target - where converted character should be copied
>> >> > + * @src_char - 2 byte little-endian source character
>> >> > + * @cp - codepage to which character should be converted
>> >> > + * @mapchar - should character be mapped according to mapchars mount option?
>> >> > + *
>> >> > + * This function handles the conversion of a single character. It is the
>> >> > + * responsibility of the caller to ensure that the target buffer is large
>> >> > + * enough to hold the result of the conversion (at least NLS_MAX_CHARSET_SIZE).
>> >> > + */
>> >> > +static int
>> >> > +cifs_mapchar(char *target, const __le16 src_char, const struct nls_table *cp,
>> >> > +            bool mapchar)
>> >> > +{
>> >> > +       int len = 1;
>> >> > +
>> >> > +       if (!mapchar)
>> >> > +               goto cp_convert;
>> >> > +
>> >> > +       /*
>> >> > +        * BB: Cannot handle remapping UNI_SLASH until all the calls to
>> >> > +        *     build_path_from_dentry are modified, as they use slash as
>> >> > +        *     separator.
>> >> > +        */
>> >> > +       switch (le16_to_cpu(src_char)) {
>> >> > +       case UNI_COLON:
>> >> > +               *target = ':';
>> >> > +               break;
>> >> > +       case UNI_ASTERIK:
>> >> > +               *target = '*';
>> >> > +               break;
>> >> > +       case UNI_QUESTION:
>> >> > +               *target = '?';
>> >> > +               break;
>> >> > +       case UNI_PIPE:
>> >> > +               *target = '|';
>> >> > +               break;
>> >> > +       case UNI_GRTRTHAN:
>> >> > +               *target = '>';
>> >> > +               break;
>> >> > +       case UNI_LESSTHAN:
>> >> > +               *target = '<';
>> >> > +               break;
>> >> > +       default:
>> >> > +               goto cp_convert;
>> >> > +       }
>> >> > +
>> >> > +out:
>> >> > +       return len;
>> >> > +
>> >> > +cp_convert:
>> >> > +       len = cp->uni2char(le16_to_cpu(src_char), target,
>> >> > +                          NLS_MAX_CHARSET_SIZE);
>> >> > +       if (len <= 0) {
>> >> > +               *target = '?';
>> >> > +               len = 1;
>> >> > +       }
>> >> > +       goto out;
>> >> > +}
>> >> > +
>> >> > +/*
>> >> > + * cifs_utf16le_to_host - convert utf16le string to local charset
>> >> > + * @to - destination buffer
>> >> > + * @from - source buffer
>> >> > + * @tolen - destination buffer size (in bytes)
>> >> > + * @fromlen - source buffer size (in bytes)
>> >> > + * @codepage - codepage to which characters should be converted
>> >> > + * @mapchar - should characters be remapped according to the mapchars option?
>> >> > + *
>> >> > + * Convert a little-endian utf16le string (as sent by the server) to a string
>> >> > + * in the provided codepage. The tolen and fromlen parameters are to ensure
>> >> > + * that the code doesn't walk off of the end of the buffer (which is always
>> >> > + * a danger if the alignment of the source buffer is off). The destination
>> >> > + * string is always properly null terminated and fits in the destination
>> >> > + * buffer. Returns the length of the destination string in bytes (including
>> >> > + * null terminator).
>> >> > + */
>> >> > +int
>> >> > +cifs_utf16le_to_host(char *to, const __le16 *from, int tolen, int fromlen,
>> >> > +                    const struct nls_table *codepage, bool mapchar)
>> >> > +{
>> >> > +       int i, charlen, safelen;
>> >> > +       int outlen = 0;
>> >> > +       int nullsize = null_charlen(codepage);
>> >> > +       int fromwords = fromlen / 2;
>> >>
>> >> I think assumption here is code values are two bytes.  I think that is
>> >> correct in case of UCS-2 encoding
>> >> but in case of UTF-16, the code values can be either two or four bytes.
>> >>
>> >
>> > Can you show me a citation? I thought UTF-16 meant a fixed-length 2
>> > byte encoding.
>> >
>>
>> Jeff,
>>
>> http://unicode.org/faq/utf_bom.html#utf16-1
>>
>> I guess we may not encounter some of those (code values) characters,
>> but again we may.
>>
>> >> > +       char tmp[NLS_MAX_CHARSET_SIZE];
>> >> > +
>> >> > +       /*
>> >> > +        * because the chars can be of varying widths, we need to take care
>> >> > +        * not to overflow the destination buffer when we get close to the
>> >> > +        * end of it. Until we get to this offset, we don't need to check
>> >> > +        * for overflow however.
>> >> > +        */
>> >> > +       safelen = tolen - (NLS_MAX_CHARSET_SIZE + nullsize);
>> >>
>> >> Can safelen become negative?  In case of a code value byte stream
>> >> consisting of say two, two byte code values?
>> >>
>> >
>> > Yes. It doesn't matter though. The math where it's checked still works.
>> >
>> >> > +
>> >> > +       for (i = 0; i < fromwords && from[i]; i++) {
>> >> > +               /*
>> >> > +                * check to see if converting this character might make the
>> >> > +                * conversion bleed into the null terminator
>> >> > +                */
>> >> > +               if (outlen >= safelen) {
>> >> > +                       charlen = cifs_mapchar(tmp, from[i], codepage, mapchar);
>> >>
>> >> If mapchar is not set, cifs_mapchar is always going to return 1 (since
>> >> uni2char always returns 1)
>> >> in case of no error.
>> >>
>> >
>> > uni2char does not always return 1. In the case of UTF-8, for instance
>> > it returns the width of the character in bytes that it put in the
>> > destination buffer.
>>
>> Jeff, can you please point me to the file where uni2char is coded thus?
>> I did not find a uni2char function returning more that 1 bytes as
>> character length/width,
>> all of them return one, even linux/fs/nls_koi8-u.c  (not sure koi is a
>> Korean charset, I just assumed from koi).
>>
>>
>
> fs/nls/nls_utf8.c
>
> uni2char returns the value returned by utf8_wctomb -- the width of the
> character in bytes. If other charsets are not doing that correctly
> then they are broken.
>
> --
> Jeff Layton <jlayton at redhat.com>
>

I think we may be OK, most of the nls_*.c charset files return 1 byte length for
uni2char and char2uni translation/mapping/encoding and that is probably
correct for that charset, do not know enough about all the charsets.
There are some charset files such as nls_cp932.c and nls_cp936.c which
return 2 bytes as character length sometimes, so looks like for a charset,
uni2char and char2uni functions return appropriate (character
width/length) values.