interesting fact about StrCaseCmp

Tue Feb 18 03:46:58 GMT 2003

On Tue, 18 Feb 2003 11:35:32 +1100
Martin Pool <mbp at samba.org> wrote:

> On 18 Feb 2003, Andrew Bartlett <abartlet at samba.org> wrote:
> 
> > Possibly only for long strings?  But then that is probably
> > micro-optimization.  
> 
> If we really cared about optimizing this function, then we would
> compare character-by-character rather than converting both strings to
> uppercase first.  This is a bit hard for some wierd encodings I know,
> but it ought to be possible to do it in charcnv.c.

Actually you got me thinking and it's not all that hard. In fact I think
there are a lot of good optamizations you can make in this function. For
example you only have to convert to wide characters if *both* characters
are multibyte sequences. If only one has the high bit on they cannot
possibly match even caseless so *str1 != *str2 clause will return.

Here's some rough code. I didn't even try to compile this.

int
utf8casecmp(const char *str1, size_t sn1, const char *str2, size_t sn2)
{
    size_t n1, n2; 
    wchar_t ucs1, ucs2;
    mbstate_t ps1, ps2;
    unsigned char uc1, uc2;

    memset(&ps1, 0, sizeof(ps1));
    memset(&ps2, 0, sizeof(ps2));
    while (sn1 > 0 && sn2 > 0) {
        if ((*str1 & 0x80) && (*str2 & 0x80)) {     /* both multibyte */
            if ((n1 = mbrtowc(&ucs1, str1, sn, &ps1)) < 0 ||
                    (n2 = mbrtowc(&ucs2, str2, sn, &ps2)) < 0) {
                perror("mbrtowc");
                return -1;
            }       
            if (ucs1 != ucs2 &&
                  (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
                return ucs1 < ucs2 ? -1 : 1; 
            }       
            sn1 -= n1; str1 += n1;
            sn2 -= n2; str2 += n2;
        } else {                          /* neither or one multibyte */
            uc1 = toupper(*str1);
            uc2 = toupper(*str2);
            if (uc1 != uc2) {
                return uc1 < uc2 ? -1 : 1; 
            } else if (uc1 == '\0') { 
                return 0;
            }       
            sn1--; str1++; 
            sn2--; str2++; 
        }       
    }
    return 0;
}

Note this assumes you're running in a UTF-8 locale. I don't know how
you handle locales. Otherwise you'll need to switch out the mbrtowc
functions. But I think the algorithm is sound.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.