[linux-cifs-client] [RFC][PATCH] cifs: add helper to simplify handling of unicode strings and use it

Sun Apr 12 14:23:29 GMT 2009

Suresh Jayaraman schrieb:
> Based on the recent discussions on opencoded, inconsistent allocation of
> memory needed for a unicode string conversion, the consensus is to
> calculate the memory needed exactly instead of using size assumptions
> and to consolidate some code that allocates memory and does conversion
> in a helper function and use it widely.
>
> This patch attempts to do the same. Please note that non-unicode cases
> has not been converted now as they are less error-prone than unicode
> cases. I've only compile tested this patch and yet to test it
> completely, but posting this little premature patch as there are lot
> of discussions happening around this code. This patch requires careful
> review.
>
> Thanks to Jeff, Steve and Simo for useful suggestions.
>
> Signed-off-by: Suresh Jayaraman <sjayaraman at suse.de>
> ---
>   

First of all, it should be clear what means "Unicode" when we talk of 
"Unicode".

There are variants with a fixed character length and a variable 
character length.

When talking about Unicode for CIFS, there is meant UCS-2 or UTF-16.
UCS-2 is fixed with 2 bytes per character.
UTF-16 can be 2 bytes or 4 bytes long.

That is the _input_.

And now the conversion to the native codepage on the system is done.

If the NLS-Codepage is e.g. iso-8859-1, all characters are 1 byte long
The first 128 characters are like ASCII, and the upper 128 characters 
are specific.

In that case we could theoretically take the length from Unistrlen and 
use this value and handle it as if it were bytes.

But now we are at the point. Imagine, the system uses UTF-8.

UTF-8, which is a form of Unicode, can be one byte long, or it can be 4 
bytes long (or 2 or 3 bytes). 5-8 bytes are not used now.
We do not know before.
We can say, a UTF-8 string is at least the number of characters long (in 
bytes), and at maximum 4 times the number of characters (worst case).

I'm not familiar with the linux kernel - is there a way to determine the 
name of the NLS that is set on the system?
>  fs/cifs/cifs_unicode.h |   22 ++++++++++++++++++++
>  fs/cifs/cifsproto.h    |    2 +
>  fs/cifs/cifssmb.c      |   38 +++++++++++++++++++++++++++++++++++
>  fs/cifs/connect.c      |   51 +++++++++++++++--------------------------------
>  fs/cifs/sess.c         |   36 +++++++++++----------------------
>  5 files changed, 90 insertions(+), 59 deletions(-)
>
> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> index 14eb9a2..43c0438 100644
> --- a/fs/cifs/cifs_unicode.h
> +++ b/fs/cifs/cifs_unicode.h
> @@ -34,6 +34,7 @@
>  #include <asm/byteorder.h>
>  #include <linux/types.h>
>  #include <linux/nls.h>
> +#include "cifspdu.h"
>  
>  #define  UNIUPR_NOLOWER		/* Example to not expand lower case tables */
>  
> @@ -63,6 +64,7 @@ int cifs_strfromUCS_le(char *, const __le16 *, int, const struct nls_table *);
>  int cifs_strtoUCS(__le16 *, const char *, int, const struct nls_table *);
>  #endif
>  
> +#define UNISTR_MAX_SIZE	(MAX_PATHCONF * (NLS_MAX_CHARSET_SIZE + 1))
>  /*
>   * UniStrcat:  Concatenate the second string to the first
>   *
> @@ -159,6 +161,26 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>  }
>  
>  /*
> + * UniStrnlenBytes: Return the length in bytes
> + */
>   
The length in bytes in NLS format. That should be pointed out here.

> +static inline int
> +UniStrnlenBytes(const char *str, int maxlen, const struct nls_table *codepage)
> +{
> +	int rc;
> +	wchar_t uni;
> +	size_t nbytes = 0;
> +	char buf[UNISTR_MAX_SIZE];
>
>   
Nice, but what if src is longer? buf would overflow, doesn't it?

> +	while (*str++) {
> +		rc = codepage->char2uni(str, maxlen, &uni);
>   
Uni2char
> +		if (rc == -1)
> +			return -EINVAL;
> +		nbytes += utf8_wcstombs(buf, &uni, maxlen);
>   
This function internally uses uni2char, so one time should be enough.
> +	}
> +	return nbytes;
> +}
> +
> +/*
>   * UniStrncat:  Concatenate length limited string
>   */
>  static inline wchar_t *
> diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
> index 4167716..d154ed1 100644
> --- a/fs/cifs/cifsproto.h
> +++ b/fs/cifs/cifsproto.h
> @@ -383,4 +383,6 @@ extern int CIFSSMBSetPosixACL(const int xid, struct cifsTconInfo *tcon,
>  		const struct nls_table *nls_codepage, int remap_special_chars);
>  extern int CIFSGetExtAttr(const int xid, struct cifsTconInfo *tcon,
>  			const int netfid, __u64 *pExtAttrBits, __u64 *pMask);
> +extern int cifs_nls_to_str(char **dst, const char *src, const int maxlen,
> +			   int *plen, const struct nls_table *nls_codepage);
>  #endif			/* _CIFSPROTO_H */
> diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
> index bc09c99..963f4f9 100644
> --- a/fs/cifs/cifssmb.c
> +++ b/fs/cifs/cifssmb.c
> @@ -81,6 +81,44 @@ static struct {
>  #endif /* CONFIG_CIFS_WEAK_PW_HASH */
>  #endif /* CIFS_POSIX */
>  
> +
> +/*
> + * Calculates, allocates memory for nls and converts it to character string.
> + * Note: caller is responsible for freeing dst if function returned 0.
> + * returns:
> + * 	on success - 0
> + * 	on failure - errno
> + */
> +int
> +cifs_nls_to_str(char **dst, const char *src, const int maxlen, int *plen,
> +		       const struct nls_table *nls_codepage)
>   
Should it be called cifs_utf16_to_nls ?
> +{
> +	size_t nbytes;
> +
> +	*plen = UniStrnlen((wchar_t *)src, maxlen);
> +	nbytes = UniStrnlenBytes(src, maxlen, nls_codepage);
> +
> +	/* We look for obvious messed up bcc or strings in response so we do
> +	 * not go off the end since (at least) WIN2K and * Windows XP have a
> +	 * major bug in not null terminating last Unicode string in response
> +	 */
> +	kfree(*dst);
> +
> +	*dst = kzalloc(nbytes + 2, GFP_KERNEL);
> +	if (!*dst)
> +		goto err_exit;
>
>   
Is src null-terminated here?
> +	cifs_strfromUCS_le(*dst, (__le16 *)src, *plen, nls_codepage);
> +	(*dst)[*plen] = 0;
> +	(*dst)[*plen + 1] = 0; /* harmless for ASCII case, needed for Unicode */
>   
This ugly piece of code should be replaced.
In ASCII (and all it's compatibles like ISO-8859, UTF-8!), a 00-octet 
means end of string.
Why should we use two bytes, when only one is needed?
Perhaps we can modify cifs_strfromUCS_le so that it leaves the 
null-termination and converts it to native.
Another common question: Is it possible to have UTF-16 as NLS?
Otherwise an NLS-String is always terminated with only one zero-byte.

Peter