strchr_m() problem (Re: Integrate i18n SWAT)

Andrew Tridgell tridge at valinux.com
Wed Aug 15 11:50:49 GMT 2001


Monyo,

Sorry for the slow reply (again)

> It seems that strchr_m() does not support multibyte codepages.
> For example, under SJIS/EUC, a string consists of 1byte chars and
> 2byte chars. If you use iconv() against those multibyte codepage,
> before calling iconv(), you have to separate ASCII, 1byte chars except
> ASCII, 2byte chars from the string because the convert rule is
> different.

here is the current function:

char *strchr_m(const char *s, char c)
{
	wpstring ws;
	pstring s2;
	smb_ucs2_t *p;

	push_ucs2(NULL, ws, s, sizeof(ws), STR_TERMINATE);
	p = strchr_wa(ws, c);
	if (!p) return NULL;
	*p = 0;
	pull_ucs2_pstring(s2, ws);
	return (char *)(s+strlen(s2));
}

the string will indeed contain 1 byte and 2 byte chars before it is
converted to ucs2 by push_ucs2(), but after that the variable "ws"
will contain only 2 byte ucs2 chars. 

Why do you need to separate the ascii and 2 byte chars before calling
push_ucs2() (which just calls iconv) ? The character set being
converted from must be ascii compatible (ie. all 7 bit chars are same
as ascii) and iconv will handle any 8 bit 1 byte chars. No?

The function strchr_wa() then looks for the offset in the ucs2 string
ws, and uses this to work out how many characters into the ucs2 string
the target character is. Then the remaining characters (the ones
before the target) are converted back to the original character set
and the length (in bytes) of the result is calculated. That length
tells us where the target character is in the original string.

I *think* that the logic behind this function is correct, but there
may be a bug. Can you see a flaw in the logic, or maybe you can give
me a specific example that fails.

btw, if this logic is indeed flawed then the whole basis of the new
character set system in the head branch could be at risk. I'm using
this kind of assumption as the basic mechanism for replacing all the
explicit knowledge of character sets that the old ja patch used.

I just hope this is a simple bug, and not a logic flaw :)

Cheers, Tridge




More information about the samba-technical mailing list