[linux-cifs-client] [PATCH 03/10] cifs: add replacement for cifs_strtoUCS_le called cifs_utf16le_to_host

Wed Apr 29 15:48:40 GMT 2009

On Wed, 29 Apr 2009 11:30:23 -0400
Jeff Layton <jlayton at redhat.com> wrote:

> > > +/*
> > > + * cifs_utf16le_to_host - convert utf16le string to local charset
> > > + * @to - destination buffer
> > > + * @from - source buffer
> > > + * @tolen - destination buffer size (in bytes)
> > > + * @fromlen - source buffer size (in bytes)
> > > + * @codepage - codepage to which characters should be converted
> > > + * @mapchar - should characters be remapped according to the mapchars option?
> > > + *
> > > + * Convert a little-endian utf16le string (as sent by the server) to a string
> > > + * in the provided codepage. The tolen and fromlen parameters are to ensure
> > > + * that the code doesn't walk off of the end of the buffer (which is always
> > > + * a danger if the alignment of the source buffer is off). The destination
> > > + * string is always properly null terminated and fits in the destination
> > > + * buffer. Returns the length of the destination string in bytes (including
> > > + * null terminator).
> > > + */
> > > +int
> > > +cifs_utf16le_to_host(char *to, const __le16 *from, int tolen, int fromlen,
> > > +                    const struct nls_table *codepage, bool mapchar)
> > > +{
> > > +       int i, charlen, safelen;
> > > +       int outlen = 0;
> > > +       int nullsize = null_charlen(codepage);
> > > +       int fromwords = fromlen / 2;
> > 
> > I think assumption here is code values are two bytes.  I think that is
> > correct in case of UCS-2 encoding
> > but in case of UTF-16, the code values can be either two or four bytes.
> > 
> 
> Can you show me a citation? I thought UTF-16 meant a fixed-length 2
> byte encoding.
> 

Ahh ok, I'm wrong here. Simo straightened me out...

"characters" in UTF-16 are a multiple of 16 bytes. So we *can* have
multiword chars (similar to how UTF-8 works). The problem we have
though is that the nls code in kernel isn't set up to deal with that. So
I don't think we can really do much at this point other than to treat
each 16 bit word as a character. If it's untranslatable to the
local charset, we'll just call it a '?' and move on.

Given that, I'll probably rename these functions to have _ucs2le_
instead of _utf16le_ since that's more accurate.

The good news is that that does not materially change how the buffer
sizing works here, and that's the immediate problem that we're trying
to solve with these patches.

-- 
Jeff Layton <jlayton at redhat.com>