utf8 vs ucs2

Ryo Kawahara rkawa at lbe.co.jp
Tue May 22 11:24:27 GMT 2001


Hello everyone.

From: Andrew Tridgell <tridge at samba.org>
Subject: Re: utf8 vs ucs2
Date: Mon, 21 May 2001 17:35:24 -0700 (PDT)

> The only sane way I can see to do strupper/strlower/strcasecmp and the
> wildcard matching is directly in ucs2. Unfortunately a lot of SMB
> calls rely on one of these functions, so we will be constantly
> converting utf8<->ucs2. That is acceptable for an intermediate step of
> a long term plan, but I would hate to see us doing that for ever. It
> would just be too slow.

I agree with using UCS2 as the internal Samba character code, although
I'm not an i18n/M17N specialist.
If you have any plan (in far future) to re-write string manipulation
to use UCS2, I recommend that  string manipulation code should be 
encapsulated into "member function"s (aka method in java etc.)
like this:

typedef struct sambastring_tag
{
	UINT16* buffer;
	int length;
	...
} sambastring;
void sambastring_get_os_encode(sambastring* obj, char* out);
void sambastring_put_os_encode(sambastring* obj, char* in);
void sambastring_get_wire_encode(sambastring* obj, uint16* out);
void sambastring_put_wire_encode(sambastring* obj, uint16* in);
int sambastring_cmp(sambastring* obj, sambastring* compared);
int sambastring_cpy(sambastring* obj, sambastring* to);
int sambastring_upper(sambastring* obj);
int sambastring_lower(sambastring* obj);
...

(this is only a sample that I concidered in my head virtually.)
Like this, and we can change the internal character code easily,
because all the relevant character manipulations are only in those
member functions.
Also we can cache some string properties (character counts) in
structure.
IMO Samba team is already using this method for loadparam struct,
and seem to be robust.
Disadvantages may be ..., that we should re-write a lot.

> Or maybe someone knows a clever plan for doing
> strupper/strlower/strcasecmp directly in utf8 quickly? Have any other
> projects managed that? If there is a way then that would change things
> dramatically. 

No idea about this. but if we use above structure, we can change
it from UCS2 to UTF8 when good algorithm has found relatively easily.

> > Another area that needs attention is localization - right now
> > none of the SAMBA client code is localized.
> 
> indeed, that does need doing, but I'd like to keep it as a completely
> separate issue. One of the problems we have had with addressing string
> handling in the past is that we have tried to solve all the problems
> at once, and that was too hard. I see string handling in Samba as
> having 3 problems:
> 
> 1) incorrect behaviour in SMB for multi-byte languages (eg. right now
>    our wildcard code is broken for multi-byte)
> 
> 2) dynamic string allocation rather than pstring/fstring
> 
> 3) localisation of messages

for 3), we can use gettext-like library to localize output messages.
In japanese version of samba, SWAT ouptpu is already displayed in
japanese ( we can change language easily if message catalog is prepared!).

///////////////////////////////////////////////////////////////
// Ryo KAWAHARA
// website: http://www3.lbe.co.jp/~rkawa/
///////////////////////////////////////////////////////////////




More information about the samba-technical mailing list