[linux-cifs-client] [PATCH] cifs: Fix insufficient memory allocation for nativeFileSystem field

Thu Apr 9 15:09:37 GMT 2009

On Thu, 2009-04-09 at 10:40 -0400, Jeff Layton wrote:
> On Thu, 09 Apr 2009 19:59:13 +0530
> Suresh Jayaraman <sjayaraman at suse.de> wrote:
> 
> > Steve French wrote:
> > > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman at suse.de> wrote:
> > >> Jeff Layton wrote:
> > >>> On Mon, 06 Apr 2009 22:33:09 +0530
> > >>> Suresh Jayaraman <sjayaraman at suse.de> wrote:
> > >>>
> > >>>> Steve French wrote:
> > >>>>> I don't think that we should be using these size assumptions
> > >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be
> > >>>>> created that calculates how much memory would be needed for a
> > >>>>> converted string - and we need to use this before we do the malloc and
> > >>>>> string conversion. �In effect a strlen and strnlen function that takes
> > >>>>> a target code page argument. �For strings that will never be more than
> > >>>>> a hundred bytes this may not be needed, and we can use the length
> > >>>>> assumption, but since mallocs in kernel can be so expensive I would
> > >>>>> rather calculate the actual string length needed for the target.
> > >>>> Ah, ok. I thought of writing a little function based on
> > >>>> cifs_strncpy_to_host() and adding a comment like below:
> > >>>>
> > >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters
> > >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8
> > >>>> �* character can be up to 8 bytes maximum, so we need to
> > >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
> > >>>> �* UTF-8 string */
> > >>>>
> > >>> I think you'll have to basically do the conversion twice. Walk the
> > >>> string once and convert each character determine its length and then
> > >>> discard it. Get the total and allocate that many bytes (plus the null
> > >> Thanks for explaining. It seems adding a new UCS helper that computes
> > >> length in bytes like the below would be good enough and make use of it
> > >> to compute length for memory allocation.
> > >>
> > >>> termination), and do the conversion again into the buffer.
> > >> Do we still need this conversion again?
> > >>
> > >>
> > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> > >> index 14eb9a2..0396bdc 100644
> > >> --- a/fs/cifs/cifs_unicode.h
> > >> +++ b/fs/cifs/cifs_unicode.h
> > >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
> > >> �}
> > >>
> > >> �/*
> > >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> > >> + */
> > >> +static inline size_t
> > >> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> > >> +{
> > >> + � � � size_t nbytes = 0;
> > >> + � � � wchar_t *uni;
> > >> +
> > >> + � � � while (*str++) {
> > >> + � � � � � � � /* convert each char, find its length and add to nbytes */
> > >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0)
> > >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> > >> + � � � }
> > >> + � � � return nbytes;
> > >> +}
> > >> +
> > >> +/*
> > >>
> > >> We would still be needing the version (UniStrnlen) that returns length
> > >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the
> > Unicode standard specifies no characters above 0x10ffff, so Unicode
> > characters can only be up to four bytes long in UTF-8.
> > >>
> > >>> I'm not truly convinced this is really necessary though. You have to
> > >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
> > >>> bytes, you get 32 anyway. You'll probably end up using roughly the same
> > >>> amount of memory that you would have had you just estimated the size.
> > > 
> > > Shaggy made the comment that the string length calculation probably
> > > won't matter (exact size vs. estimate) for most cases in cifs since
> > > small allocations off the slab are fairly fast and it doesn't hurt to
> > > overallocate by this amount.    Although for the typical cases a
> > > Unicode string usually will shrink when converted to UTF-8 obviously
> > > we have to allow for the maximum size conversion.
> > > 
> > > 
> > 
> > OTOH, felix-suse at fefe.de pointed me to utf-8 man page:
> > 
> > * UTF-8 encoded UCS characters may be up to six bytes long, however the
> > Unicode standard specifies no characters above 0x10ffff, so Unicode
> > characters can only be up to four bytes long in UTF-8.
> 
> Don't they mean 3 bytes there?

Nope, Suresh is right, RFC 3629 restricted the Unicode range so that
effectively the valid characters are all represented within 4 bytes.

Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So in theory 4 bytes are enough, I guess you then have to make sure the
conversion routines used respect this limit and you have to think how to
address cases where invalid sequences come in.

Aside:
I think Windows servers will return you "invalid" (for example, in their
NFS server, MS uses some of the space to map characters that are invalid
for windows in filenames, but valid over NFS, if your cifs client is
used to access shares that are also exported via NFS on Windows, I
believe the server may send you these characters.) as  UTF-16 sequences,
IIRC at least in some versions I think we have also seen that their file
system layer did not validate UTF-16 sequences (because initially it was
UCS2 where all values are "valid").

So in talking with windows servers you must think how to deal with these
deviations from the standards (yeah, fun). Of course one way to deal
with it is to just throw an error, and deem the sequence invalid.

> > Going by this, length * 2 (original code) might still be sufficient?
> > 
> 
> I think the safest thing is still to just calculate the exact lengths of
> the buffer before allocating. It's hard to imagine that it'll have
> significant performance impact.

UTF-16 -> UTF-8 can be expensive in some cases, it really depends how
critical is the code path that needs the conversion.
It may make sense to allow the conversion routines to do the memory
allocation on their own.

> If we later find that it does then we can look at optimizing those cases
> for speed instead of size, but at least at that point we're working
> with code that has the buffers sufficiently sized.

Make sense.

Simo.

-- 
Simo Sorce
Samba Team GPL Compliance Officer <simo at samba.org>
Principal Software Engineer at Red Hat, Inc. <simo at redhat.com>