interesting fact about StrCaseCmp

Michael B. Allen miallen at eskimo.com
Tue Feb 18 03:46:58 GMT 2003


On Tue, 18 Feb 2003 11:35:32 +1100
Martin Pool <mbp at samba.org> wrote:

> On 18 Feb 2003, Andrew Bartlett <abartlet at samba.org> wrote:
> 
> > Possibly only for long strings?  But then that is probably
> > micro-optimization.  
> 
> If we really cared about optimizing this function, then we would
> compare character-by-character rather than converting both strings to
> uppercase first.  This is a bit hard for some wierd encodings I know,
> but it ought to be possible to do it in charcnv.c.

Actually you got me thinking and it's not all that hard. In fact I think
there are a lot of good optamizations you can make in this function. For
example you only have to convert to wide characters if *both* characters
are multibyte sequences. If only one has the high bit on they cannot
possibly match even caseless so *str1 != *str2 clause will return.

Here's some rough code. I didn't even try to compile this.

int
utf8casecmp(const char *str1, size_t sn1, const char *str2, size_t sn2)
{
    size_t n1, n2; 
    wchar_t ucs1, ucs2;
    mbstate_t ps1, ps2;
    unsigned char uc1, uc2;

    memset(&ps1, 0, sizeof(ps1));
    memset(&ps2, 0, sizeof(ps2));
    while (sn1 > 0 && sn2 > 0) {
        if ((*str1 & 0x80) && (*str2 & 0x80)) {     /* both multibyte */
            if ((n1 = mbrtowc(&ucs1, str1, sn, &ps1)) < 0 ||
                    (n2 = mbrtowc(&ucs2, str2, sn, &ps2)) < 0) {
                perror("mbrtowc");
                return -1;
            }       
            if (ucs1 != ucs2 &&
                  (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
                return ucs1 < ucs2 ? -1 : 1; 
            }       
            sn1 -= n1; str1 += n1;
            sn2 -= n2; str2 += n2;
        } else {                          /* neither or one multibyte */
            uc1 = toupper(*str1);
            uc2 = toupper(*str2);
            if (uc1 != uc2) {
                return uc1 < uc2 ? -1 : 1; 
            } else if (uc1 == '\0') { 
                return 0;
            }       
            sn1--; str1++; 
            sn2--; str2++; 
        }       
    }
    return 0;
}

Note this assumes you're running in a UTF-8 locale. I don't know how
you handle locales. Otherwise you'll need to switch out the mbrtowc
functions. But I think the algorithm is sound.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the samba-technical mailing list