[SCM] Samba Shared Repository - branch v3-6-test updated

Michael Adam obnox at samba.org
Thu Nov 4 09:46:57 MDT 2010


The branch, v3-6-test has been updated
       via  e5226a3 librpc/ndr: use new strlen_m_ext_term() in ndr_charset_length(): fix bug #7594
       via  726b927 lib/util/charset/charconv: clarify comments in next_codepoint_convenience_ext()
       via  2b00a19 lib/util/charset/util_unistr: clarify the comment header for strlen_m().
       via  c3d4655 lib/util/charset/util_unistr: add strlen_m_ext_term() - variant of strlen_m_ext() counting terminator
       via  8a9fbf5 lib/util/charset/util_unistr: add strlen_m_ext that takes input and output charset
       via  4a4a4c2 lib/charcnv/util_unistr: add next_codepoint_ext() that accepts input charset.
       via  00a1630 lib/charset/charcnv: rename a parameter of next_codepoint_convenience_ext() for clarity
       via  8288220 lib/charset/charcnv: add next_codepoint_convenience_ext() that accepts input charset.
       via  7c6894e s3:util_str: add strlen_m_ext_term() - variant of strlen_m_ext() counting terminator
       via  aae362e s3:lib/util_str: add strlen_m_ext() that takes input and output charset
       via  efd990b s3:lib/util_str: clarify the comment header for strlen_m().
       via  7b99de4 s3:lib/charcnv: clarify comments in next_codepoint_ext()
       via  93dde94 s3:lib/charcnv: rename a parameter for clarity in next_codepoint_ext()
       via  db05686 s3:lib/charcnv: reformat comments in next_codepoint_ext()
       via  190d0b2 s3:lib/charcnv: add next_codepoint_ext() that accepts input charset.
       via  9775047 util/charset: remove a duplicate comment.
      from  0690434 s3: Align nttrans replies the same way Windows does it

http://gitweb.samba.org/?p=samba.git;a=shortlog;h=v3-6-test


- Log -----------------------------------------------------------------
commit e5226a3bd34eafae8870fde8bf4d30ea246bd963
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 02:04:25 2010 +0200

    librpc/ndr: use new strlen_m_ext_term() in ndr_charset_length(): fix bug #7594
    
    This fixes the calculation of needed space for destination unicode charset.
    
    Autobuild-User: Michael Adam <obnox at samba.org>
    Autobuild-Date: Wed Nov  3 23:28:07 UTC 2010 on sn-devel-104

commit 726b927c67c8e3f607eefc335b2ba81fa38537a9
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 08:21:41 2010 +0100

    lib/util/charset/charconv: clarify comments in next_codepoint_convenience_ext()
    
    Give the unicod U+<hexnumber> notation of the codepoints
    referred to in the comments. Also reformat the comments some.

commit 2b00a19598b22df0e0413698c4cc1e332ecd1592
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 08:02:17 2010 +0100

    lib/util/charset/util_unistr: clarify the comment header for strlen_m().

commit c3d46552602f450be24f08b8ff4d88911b22ec78
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 02:02:16 2010 +0200

    lib/util/charset/util_unistr: add strlen_m_ext_term() - variant of strlen_m_ext() counting terminator

commit 8a9fbf594a3d1ab9a4e6efb663ae7e58f8213592
Author: Michael Adam <obnox at samba.org>
Date:   Sat Oct 30 02:03:02 2010 +0200

    lib/util/charset/util_unistr: add strlen_m_ext that takes input and output charset
    
    The function calculates the number of units (8 or 16-bit, depending
    on the destination charset), that would be needed to convert the
    input string which is expected to be in in src_charset encoding
    to the dst_charset (which should be a unicode charset).

commit 4a4a4c2c1a242d5c067a40e2f38ee4a5993b29bf
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 02:18:46 2010 +0100

    lib/charcnv/util_unistr: add next_codepoint_ext() that accepts input charset.
    
    next_codepoint() takes as string in CH_UNIX encoding and returns the
    unicode codepoint of the next (possibly multibyte) character of the
    input string.
    
    The new next_codepoint_ext() function adds the encoding of the input
    string as a parameter. next_codepoint() now only calls next_codepoint_ext()
    with CH_UNIX als src_charset argument.

commit 00a163072e6162b0d50139a94356ef2f3256307d
Author: Michael Adam <obnox at samba.org>
Date:   Mon Nov 1 15:53:43 2010 +0100

    lib/charset/charcnv: rename a parameter of next_codepoint_convenience_ext() for clarity

commit 8288220921e50a72ecc3d03dd7e6b176e8353062
Author: Michael Adam <obnox at samba.org>
Date:   Fri Oct 29 22:06:05 2010 +0200

    lib/charset/charcnv: add next_codepoint_convenience_ext() that accepts input charset.
    
    next_codepoint_convenience() takes as string in CH_UNIX encoding and returns the
    unicode codepoint of the next (possibly multibyte) character of the
    input string.
    
    The new next_codepoint_convenience_ext() function adds the encoding of the input
    string as a parameter. next_codepoint_convenience() now only calls
    next_codepoint_convenience_ext() with CH_UNIX als src_charset argument.

commit 7c6894ee7d4e88d698c4139a67b1d7898a07f765
Author: Michael Adam <obnox at samba.org>
Date:   Sun Oct 31 02:02:16 2010 +0200

    s3:util_str: add strlen_m_ext_term() - variant of strlen_m_ext() counting terminator

commit aae362e6cf5ff26486463d4a08469071d4f2ea65
Author: Michael Adam <obnox at samba.org>
Date:   Sat Oct 30 02:03:02 2010 +0200

    s3:lib/util_str: add strlen_m_ext() that takes input and output charset
    
    The function calculates the number of units (8 or 16-bit, depending
    on the destination charset), that would be needed to convert the
    input string which is expected to be in in src_charset encoding
    to the dst_charset (which should be a unicode charset).

commit efd990b581940e98c5128f4d163db4c1f6a8ec29
Author: Michael Adam <obnox at samba.org>
Date:   Fri Oct 29 22:21:47 2010 +0200

    s3:lib/util_str: clarify the comment header for strlen_m().

commit 7b99de4501e3f221262712f1575f97599d7bbcba
Author: Michael Adam <obnox at samba.org>
Date:   Fri Oct 29 22:11:30 2010 +0200

    s3:lib/charcnv: clarify comments in next_codepoint_ext()
    
    (giving the unicod U+<hexnumber> notation of the codepoints
     referred to in the comments)

commit 93dde9415b50a9a0afeb570848ba553a452849ff
Author: Michael Adam <obnox at samba.org>
Date:   Mon Nov 1 15:42:21 2010 +0100

    s3:lib/charcnv: rename a parameter for clarity in next_codepoint_ext()

commit db05686db7891c25209b9ed019c8d5eda28dd527
Author: Michael Adam <obnox at samba.org>
Date:   Mon Nov 1 15:42:21 2010 +0100

    s3:lib/charcnv: reformat comments in next_codepoint_ext()

commit 190d0b28f452ecc427de52da3a6469af3e225488
Author: Michael Adam <obnox at samba.org>
Date:   Fri Oct 29 22:06:05 2010 +0200

    s3:lib/charcnv: add next_codepoint_ext() that accepts input charset.
    
    next_codepoint() takes as string in CH_UNIX encoding and returns the
    unicode codepoint of the next (possibly multibyte) character of the
    input string.
    
    The new next_codepoint_ext() function adds the encoding of the input
    string as a parameter. next_codepoint() now only calls next_codepoint_ext()
    with CH_UNIX als src_charset argument.

commit 977504757270e069b6221ab559830d4a29005812
Author: Michael Adam <obnox at samba.org>
Date:   Fri Oct 29 20:50:28 2010 +0200

    util/charset: remove a duplicate comment.
    
    This seems to have been copied twice from source3/ code.

-----------------------------------------------------------------------

Summary of changes:
 lib/util/charset/charcnv.c     |   66 ++++++++++++++++++++++----------
 lib/util/charset/charset.h     |   13 ++++--
 lib/util/charset/util_unistr.c |   77 +++++++++++++++++++++++++++++++++-----
 librpc/ndr/ndr_string.c        |    4 +-
 source3/include/proto.h        |    6 +++
 source3/lib/charcnv.c          |   81 ++++++++++++++++++++++++++-------------
 source3/lib/util_str.c         |   77 ++++++++++++++++++++++++++++++++------
 7 files changed, 248 insertions(+), 76 deletions(-)


Changeset truncated at 500 lines:

diff --git a/lib/util/charset/charcnv.c b/lib/util/charset/charcnv.c
index e9f6ab0..f8aeea3 100644
--- a/lib/util/charset/charcnv.c
+++ b/lib/util/charset/charcnv.c
@@ -373,17 +373,25 @@ _PUBLIC_ bool convert_string_talloc_convenience(TALLOC_CTX *ctx,
 	return true;
 }
 
-/*
-  return the unicode codepoint for the next multi-byte CH_UNIX character
-  in the string
 
-  also return the number of bytes consumed (which tells the caller
-  how many bytes to skip to get to the next CH_UNIX character)
-
-  return INVALID_CODEPOINT if the next character cannot be converted
-*/
-_PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic, 
-				    const char *str, size_t *size)
+/**
+ * Return the unicode codepoint for the next character in the input
+ * string in the given src_charset.
+ * The unicode codepoint (codepoint_t) is an unsinged 32 bit value.
+ *
+ * Also return the number of bytes consumed (which tells the caller
+ * how many bytes to skip to get to the next src_charset-character).
+ *
+ * This is implemented (in the non-ascii-case) by first converting the
+ * next character in the input string to UTF16_LE and then calculating
+ * the unicode codepoint from that.
+ *
+ * Return INVALID_CODEPOINT if the next character cannot be converted.
+ */
+_PUBLIC_ codepoint_t next_codepoint_convenience_ext(
+			struct smb_iconv_convenience *ic,
+			const char *str, charset_t src_charset,
+			size_t *bytes_consumed)
 {
 	/* it cannot occupy more than 4 bytes in UTF16 format */
 	uint8_t buf[4];
@@ -394,24 +402,27 @@ _PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic
 	char *outbuf;
 
 	if ((str[0] & 0x80) == 0) {
-		*size = 1;
+		*bytes_consumed = 1;
 		return (codepoint_t)str[0];
 	}
 
-	/* we assume that no multi-byte character can take
-	   more than 5 bytes. This is OK as we only
-	   support codepoints up to 1M */
+	/*
+	 * we assume that no multi-byte character can take more than 5 bytes.
+	 * This is OK as we only support codepoints up to 1M (U+100000)
+	 */
 	ilen_orig = strnlen(str, 5);
 	ilen = ilen_orig;
 
-	descriptor = get_conv_handle(ic, CH_UNIX, CH_UTF16);
+	descriptor = get_conv_handle(ic, src_charset, CH_UTF16);
 	if (descriptor == (smb_iconv_t)-1) {
-		*size = 1;
+		*bytes_consumed = 1;
 		return INVALID_CODEPOINT;
 	}
 
-	/* this looks a little strange, but it is needed to cope
-	   with codepoints above 64k */
+	/*
+	 * this looks a little strange, but it is needed to cope with
+	 * codepoints above 64k (U+1000) which are encoded as per RFC2781.
+	 */
 	olen = 2;
 	outbuf = (char *)buf;
 	smb_iconv(descriptor, &str, &ilen, &outbuf, &olen);
@@ -421,7 +432,7 @@ _PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic
 		smb_iconv(descriptor,  &str, &ilen, &outbuf, &olen);
 		if (olen == 4) {
 			/* we didn't convert any bytes */
-			*size = 1;
+			*bytes_consumed = 1;
 			return INVALID_CODEPOINT;
 		}
 		olen = 4 - olen;
@@ -429,7 +440,7 @@ _PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic
 		olen = 2 - olen;
 	}
 
-	*size = ilen_orig - ilen;
+	*bytes_consumed = ilen_orig - ilen;
 
 	if (olen == 2) {
 		return (codepoint_t)SVAL(buf, 0);
@@ -446,6 +457,21 @@ _PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic
 }
 
 /*
+  return the unicode codepoint for the next multi-byte CH_UNIX character
+  in the string
+
+  also return the number of bytes consumed (which tells the caller
+  how many bytes to skip to get to the next CH_UNIX character)
+
+  return INVALID_CODEPOINT if the next character cannot be converted
+*/
+_PUBLIC_ codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic,
+				    const char *str, size_t *size)
+{
+	return next_codepoint_convenience_ext(ic, str, CH_UNIX, size);
+}
+
+/*
   push a single codepoint into a CH_UNIX string the target string must
   be able to hold the full character, which is guaranteed if it is at
   least 5 bytes in size. The caller may pass less than 5 bytes if they
diff --git a/lib/util/charset/charset.h b/lib/util/charset/charset.h
index bd08f7e..92ea730 100644
--- a/lib/util/charset/charset.h
+++ b/lib/util/charset/charset.h
@@ -39,11 +39,6 @@ typedef enum {CH_UTF16LE=0, CH_UTF16=0, CH_UNIX, CH_DISPLAY, CH_DOS, CH_UTF8, CH
 
 typedef uint16_t smb_ucs2_t;
 
-/*
- * SMB UCS2 (16-bit unicode) internal type.
- * smb_ucs2_t is *always* in little endian format.
- */
-
 #ifdef WORDS_BIGENDIAN
 #define UCS2_SHIFT 8
 #else
@@ -125,6 +120,9 @@ struct smb_iconv_convenience;
 #define strupper(s) strupper_m(s)
 
 char *strchr_m(const char *s, char c);
+size_t strlen_m_ext(const char *s, charset_t src_charset, charset_t dst_charset);
+size_t strlen_m_ext_term(const char *s, charset_t src_charset,
+			 charset_t dst_charset);
 size_t strlen_m_term(const char *s);
 size_t strlen_m_term_null(const char *s);
 size_t strlen_m(const char *s);
@@ -173,10 +171,15 @@ ssize_t iconv_talloc(TALLOC_CTX *mem_ctx,
 
 extern struct smb_iconv_convenience *global_iconv_convenience;
 
+codepoint_t next_codepoint_ext(const char *str, charset_t src_charset,
+			       size_t *size);
 codepoint_t next_codepoint(const char *str, size_t *size);
 ssize_t push_codepoint(char *str, codepoint_t c);
 
 /* codepoints */
+codepoint_t next_codepoint_convenience_ext(struct smb_iconv_convenience *ic,
+			    const char *str, charset_t src_charset,
+			    size_t *size);
 codepoint_t next_codepoint_convenience(struct smb_iconv_convenience *ic, 
 			    const char *str, size_t *size);
 ssize_t push_codepoint_convenience(struct smb_iconv_convenience *ic, 
diff --git a/lib/util/charset/util_unistr.c b/lib/util/charset/util_unistr.c
index 520ce05..4105474 100644
--- a/lib/util/charset/util_unistr.c
+++ b/lib/util/charset/util_unistr.c
@@ -249,11 +249,12 @@ _PUBLIC_ char *alpha_strcpy(char *dest, const char *src, const char *other_safe_
 }
 
 /**
- Count the number of UCS2 characters in a string. Normally this will
- be the same as the number of bytes in a string for single byte strings,
- but will be different for multibyte.
-**/
-_PUBLIC_ size_t strlen_m(const char *s)
+ * Calculate the number of units (8 or 16-bit, depending on the
+ * destination charset), that would be needed to convert the input
+ * string which is expected to be in in src_charset encoding to the
+ * destination charset (which should be a unicode charset).
+ */
+_PUBLIC_ size_t strlen_m_ext(const char *s, charset_t src_charset, charset_t dst_charset)
 {
 	size_t count = 0;
 	struct smb_iconv_convenience *ic = get_iconv_convenience();
@@ -273,18 +274,68 @@ _PUBLIC_ size_t strlen_m(const char *s)
 
 	while (*s) {
 		size_t c_size;
-		codepoint_t c = next_codepoint_convenience(ic, s, &c_size);
-		if (c < 0x10000) {
+		codepoint_t c = next_codepoint_convenience_ext(ic, s, src_charset, &c_size);
+		s += c_size;
+
+		switch (dst_charset) {
+		case CH_UTF16LE:
+		case CH_UTF16BE:
+		case CH_UTF16MUNGED:
+			if (c < 0x10000) {
+				count += 1;
+			} else {
+				count += 2;
+			}
+			break;
+		case CH_UTF8:
+			/*
+			 * this only checks ranges, and does not
+			 * check for invalid codepoints
+			 */
+			if (c < 0x80) {
+				count += 1;
+			} else if (c < 0x800) {
+				count += 2;
+			} else if (c < 0x1000) {
+				count += 3;
+			} else {
+				count += 4;
+			}
+			break;
+		default:
+			/*
+			 * non-unicode encoding:
+			 * assume that each codepoint fits into
+			 * one unit in the destination encoding.
+			 */
 			count += 1;
-		} else {
-			count += 2;
 		}
-		s += c_size;
 	}
 
 	return count;
 }
 
+_PUBLIC_ size_t strlen_m_ext_term(const char *s, const charset_t src_charset,
+				  const charset_t dst_charset)
+{
+	if (!s) {
+		return 0;
+	}
+	return strlen_m_ext(s, src_charset, dst_charset) + 1;
+}
+
+/**
+ * Calculate the number of 16-bit units that would be needed to convert
+ * the input string which is expected to be in CH_UNIX encoding to UTF16.
+ *
+ * This will be the same as the number of bytes in a string for single
+ * byte strings, but will be different for multibyte.
+ */
+_PUBLIC_ size_t strlen_m(const char *s)
+{
+	return strlen_m_ext(s, CH_UNIX, CH_UTF16LE);
+}
+
 /**
    Work out the number of multibyte chars in a string, including the NULL
    terminator.
@@ -992,6 +1043,12 @@ _PUBLIC_ bool convert_string_talloc(TALLOC_CTX *ctx,
 											 allow_badcharcnv);
 }
 
+_PUBLIC_ codepoint_t next_codepoint_ext(const char *str, charset_t src_charset,
+					size_t *size)
+{
+	return next_codepoint_convenience_ext(get_iconv_convenience(), str,
+					      src_charset, size);
+}
 
 _PUBLIC_ codepoint_t next_codepoint(const char *str, size_t *size)
 {
diff --git a/librpc/ndr/ndr_string.c b/librpc/ndr/ndr_string.c
index 2e04633..cc849bf 100644
--- a/librpc/ndr/ndr_string.c
+++ b/librpc/ndr/ndr_string.c
@@ -719,11 +719,11 @@ _PUBLIC_ uint32_t ndr_charset_length(const void *var, charset_t chset)
 	case CH_UTF16LE:
 	case CH_UTF16BE:
 	case CH_UTF16MUNGED:
-		return strlen_m_term((const char *)var);
+	case CH_UTF8:
+		return strlen_m_ext_term((const char *)var, CH_UNIX, chset);
 	case CH_DISPLAY:
 	case CH_DOS:
 	case CH_UNIX:
-	case CH_UTF8:
 		return strlen((const char *)var)+1;
 	}
 
diff --git a/source3/include/proto.h b/source3/include/proto.h
index d3d4b20..58abc58 100644
--- a/source3/include/proto.h
+++ b/source3/include/proto.h
@@ -468,6 +468,8 @@ size_t pull_string_talloc_fn(const char *function,
 			size_t src_len,
 			int flags);
 size_t align_string(const void *base_ptr, const char *p, int flags);
+codepoint_t next_codepoint_ext(const char *str, charset_t src_charset,
+			       size_t *bytes_consumed);
 codepoint_t next_codepoint(const char *str, size_t *size);
 
 /* The following definitions come from lib/clobber.c  */
@@ -1586,6 +1588,10 @@ char *strnrchr_m(const char *s, char c, unsigned int n);
 char *strstr_m(const char *src, const char *findstr);
 void strlower_m(char *s);
 void strupper_m(char *s);
+size_t strlen_m_ext(const char *s, const charset_t src_charset,
+		    const charset_t dst_charset);
+size_t strlen_m_ext_term(const char *s, const charset_t src_charset,
+			 const charset_t dst_charset);
 size_t strlen_m(const char *s);
 size_t strlen_m_term(const char *s);
 size_t strlen_m_term_null(const char *s);
diff --git a/source3/lib/charcnv.c b/source3/lib/charcnv.c
index 9ac9930..3b6dfc5 100644
--- a/source3/lib/charcnv.c
+++ b/source3/lib/charcnv.c
@@ -1793,17 +1793,23 @@ size_t align_string(const void *base_ptr, const char *p, int flags)
 	return 0;
 }
 
-/*
-  Return the unicode codepoint for the next multi-byte CH_UNIX character
-  in the string. The unicode codepoint (codepoint_t) is an unsinged 32 bit value.
-
-  Also return the number of bytes consumed (which tells the caller
-  how many bytes to skip to get to the next CH_UNIX character).
-
-  Return INVALID_CODEPOINT if the next character cannot be converted.
-*/
+/**
+ * Return the unicode codepoint for the next character in the input
+ * string in the given src_charset.
+ * The unicode codepoint (codepoint_t) is an unsinged 32 bit value.
+ *
+ * Also return the number of bytes consumed (which tells the caller
+ * how many bytes to skip to get to the next src_charset-character).
+ *
+ * This is implemented (in the non-ascii-case) by first converting the
+ * next character in the input string to UTF16_LE and then calculating
+ * the unicode codepoint from that.
+ *
+ * Return INVALID_CODEPOINT if the next character cannot be converted.
+ */
 
-codepoint_t next_codepoint(const char *str, size_t *size)
+codepoint_t next_codepoint_ext(const char *str, charset_t src_charset,
+			       size_t *bytes_consumed)
 {
 	/* It cannot occupy more than 4 bytes in UTF16 format */
 	uint8_t buf[4];
@@ -1813,41 +1819,46 @@ codepoint_t next_codepoint(const char *str, size_t *size)
 	size_t olen;
 	char *outbuf;
 
+	/* fastpath if the character is ASCII */
 	if ((str[0] & 0x80) == 0) {
-		*size = 1;
+		*bytes_consumed = 1;
 		return (codepoint_t)str[0];
 	}
 
-	/* We assume that no multi-byte character can take
-	   more than 5 bytes. This is OK as we only
-	   support codepoints up to 1M */
+	/*
+	 * We assume that no multi-byte character can take more than
+	 * 5 bytes. This is OK as we only support codepoints up to 1M (U+100000)
+	 */
 
 	ilen_orig = strnlen(str, 5);
 	ilen = ilen_orig;
 
-        lazy_initialize_conv();
+	lazy_initialize_conv();
 
-        descriptor = conv_handles[CH_UNIX][CH_UTF16LE];
+	descriptor = conv_handles[src_charset][CH_UTF16LE];
 	if (descriptor == (smb_iconv_t)-1 || descriptor == (smb_iconv_t)0) {
-		*size = 1;
+		*bytes_consumed = 1;
 		return INVALID_CODEPOINT;
 	}
 
-	/* This looks a little strange, but it is needed to cope
-	   with codepoints above 64k which are encoded as per RFC2781. */
+	/*
+	 * This looks a little strange, but it is needed to cope
+	 * with codepoints above 64k (U+10000) which are encoded as per RFC2781.
+	 */
 	olen = 2;
 	outbuf = (char *)buf;
 	smb_iconv(descriptor, &str, &ilen, &outbuf, &olen);
 	if (olen == 2) {
-		/* We failed to convert to a 2 byte character.
-		   See if we can convert to a 4 UTF16-LE byte char encoding.
-		*/
+		/*
+		 * We failed to convert to a 2 byte character.
+		 * See if we can convert to a 4 UTF16-LE byte char encoding.
+		 */
 		olen = 4;
 		outbuf = (char *)buf;
 		smb_iconv(descriptor,  &str, &ilen, &outbuf, &olen);
 		if (olen == 4) {
 			/* We didn't convert any bytes */
-			*size = 1;
+			*bytes_consumed = 1;
 			return INVALID_CODEPOINT;
 		}
 		olen = 4 - olen;
@@ -1855,16 +1866,17 @@ codepoint_t next_codepoint(const char *str, size_t *size)
 		olen = 2 - olen;
 	}
 
-	*size = ilen_orig - ilen;
+	*bytes_consumed = ilen_orig - ilen;
 
 	if (olen == 2) {
 		/* 2 byte, UTF16-LE encoded value. */
 		return (codepoint_t)SVAL(buf, 0);
 	}
 	if (olen == 4) {
-		/* Decode a 4 byte UTF16-LE character manually.
-		   See RFC2871 for the encoding machanism.
-		*/
+		/*
+		 * Decode a 4 byte UTF16-LE character manually.
+		 * See RFC2871 for the encoding machanism.
+		 */
 		codepoint_t w1 = SVAL(buf,0) & ~0xD800;
 		codepoint_t w2 = SVAL(buf,2) & ~0xDC00;
 
@@ -1877,6 +1889,21 @@ codepoint_t next_codepoint(const char *str, size_t *size)
 }
 
 /*
+  Return the unicode codepoint for the next multi-byte CH_UNIX character
+  in the string. The unicode codepoint (codepoint_t) is an unsinged 32 bit value.
+
+  Also return the number of bytes consumed (which tells the caller
+  how many bytes to skip to get to the next CH_UNIX character).
+
+  Return INVALID_CODEPOINT if the next character cannot be converted.
+*/
+
+codepoint_t next_codepoint(const char *str, size_t *size)
+{
+	return next_codepoint_ext(str, CH_UNIX, size);
+}
+
+/*
   push a single codepoint into a CH_UNIX string the target string must
   be able to hold the full character, which is guaranteed if it is at
   least 5 bytes in size. The caller may pass less than 5 bytes if they
diff --git a/source3/lib/util_str.c b/source3/lib/util_str.c
index 449b5d1..508050d 100644
--- a/source3/lib/util_str.c
+++ b/source3/lib/util_str.c
@@ -1454,12 +1454,14 @@ void strupper_m(char *s)
 }
 
 /**
- Count the number of UCS2 characters in a string. Normally this will
- be the same as the number of bytes in a string for single byte strings,
- but will be different for multibyte.
-**/
+ * Calculate the number of units (8 or 16-bit, depending on the
+ * destination charset), that would be needed to convert the input
+ * string which is expected to be in in src_charset encoding to the
+ * destination charset (which should be a unicode charset).
+ */
 
-size_t strlen_m(const char *s)
+size_t strlen_m_ext(const char *s, const charset_t src_charset,
+		    const charset_t dst_charset)
 {
 	size_t count = 0;
 
@@ -1478,20 +1480,71 @@ size_t strlen_m(const char *s)
 
 	while (*s) {
 		size_t c_size;
-		codepoint_t c = next_codepoint(s, &c_size);
-		if (c < 0x10000) {
-			/* Unicode char fits into 16 bits. */
+		codepoint_t c = next_codepoint_ext(s, src_charset, &c_size);
+		s += c_size;
+
+		switch (dst_charset) {
+		case CH_UTF16LE:
+		case CH_UTF16BE:
+		case CH_UTF16MUNGED:
+			if (c < 0x10000) {
+				/* Unicode char fits into 16 bits. */
+				count += 1;
+			} else {
+				/* Double-width unicode char - 32 bits. */
+				count += 2;
+			}
+			break;
+		case CH_UTF8:
+			/*
+			 * this only checks ranges, and does not


-- 
Samba Shared Repository


More information about the samba-cvs mailing list