utf8 vs ucs2

Mon May 21 23:13:28 GMT 2001

Sensible readers will have skipped the little flamewar between Kenichi
and me, so I thought I'd start a separate thread to discuss the plan
that I am currently working on to properly support ucs2. 

It has become pretty obvious over the last couple of years that we
need to add proper support for multi-byte languages and ucs2 in
Samba. This is not just because Samba is currently quite broken for
multi-byte languages, it's also because more and more parts of
Microsofts implementation force ucs2 even when the server negotiates
ascii, so we have to handle it even for English.

The big question is whether we convert to utf8 internally or
ucs2. I'll ignore 4 byte unicode for reasons discussed previously.

The advantage of ucs2 is that it is what MS use internally, so we have
potentially zero conversions to do from the wire (neglecting the parts
of the protocol that are ascii only). The disadvantage of ucs2 is that
it would break just about every function in Samba. Just try changing
char* to uint16* in Samba and see how far you get trying to compile
it. I have looked at doing this conversion and my conclusion is that
trying a direct conversion to ucs2 would be a massive job, that I am
pretty sure would never get completed. We need a way to get there from
where we are now while keeping the code running at all the
intermediate steps. That's where the cunning plan comes in.

So, here is the plan:

Basics: We will have 4 string formats. They are called "wire",
"internal", "unicode (ucs2)" and "os". The wire format is determined
on a per-packet basis, the parser needs to know what it is. Internal
format will initially be utf8, but read on for the long term
plan. ucs2 is as used by MS in main SMB protocol (ie. intel byte
order). "os" format is whatever the OS uses for the filesystem.

Step 1: change all parsing code to use a single set of functions to
convert from wire string format to internal string format.

Step 2: change all vfs functions accept internal format and convert to
"os" format before passing to the OS.

Step 3: use iconv for string conversion in wire converter and vfs

Step 4: Remove all unix_to_dos() and dos_to_unix() calls. That also
means we can remove all our existing codepage code and all of the
codepages. We need to keep our ucs2 flags table (so we can do strupper
and strlower) but the rest can go.

Step 5: Write utf8<->ucs2 conversion functions (initially using iconv)

Step 6: In functions that depend on character size (mostly wildcard
code) convert to ucs2 on entry to the function and convert back to
utf8 on exit. This will be *slow*.

Step 7: Incrementally convert more functions to use ucs2 internally,
with ucs2<->utf8 conversion on entry and exit of function

Step 8: When two functions that have been converted call each other
they can pass ucs2 direct, bypassing the conversion

Step 9: Initially "islands" of ucs2 appear in the code (first in
wildcard code) then these islands spread. When they cover most of the 
code, we change internal format to being ucs2, and instead of
ucs2<->utf8 conversion only on those functions that are not yet
converted

Step 10: We are now completely ucs2 converted. Party.

So what has been done? I have done most of Step 1 in head already (see
srvstr_*() functions). These functions need to be made to use iconv
and I need to change lots of strcpy() calls in the pure ascii code to
use srvstr_push_ascii() and srvstr_pull_ascii() in preparation for
dropping our current codepage support. This means smbd in the head
branch now negotiates and correctly talks ucs2 on the wire. A nice
side effect of this change is that Samba supports long share names
with NT (that problem was caused by NT not handling long share names
with ascii servers).

I have also completely converted the cli_*() client side SMB code to
use clistr_*() functions so now smbclient (and all our other
utilities) can now talk ucs2 to servers. That change was included in
the 2.2.x code, so if you use smbclient in 2.2.1 to connect to NT then
it will be talking unicode. 

Iverg has been looking at steps 2 and 3. The rest need doing when the
time comes.

Thee big disadvantage of this plan is that until step 9 we will be
making smbd *much* slower. Luckily this can be offset by doing some of
step 8 on timing critical paths. We gain back all our speed on step 9.

Step 4 is also nasty, but necessary. It will initially break
everything but english. To make this work I plan on inventing a new
"nasty" string format which will be deliberately broken for english,
and add that to ivonv. Then for testing we can use that format
internally. That way we can test multi-byte functionality without
learning a multi-byte language.

The big advantage of this plan is that the only code that needs to
know about client string formats and the absolute mess that SMB makes
of them is the parsing code, and that can just call srvstr_*() with
the right flags to get the work done. The "logic" code in smbd (which
is most of the code) only has to deal with the internal formats. We no
longer have the problem of "I wonder what string format this char* is
in?" that we have now.

Up to now we have handled internationalisation by adding dos_to_unix()
and unix_to_dos() calls wherever we found that something was
broken. That is what led to our major string handling problem as we
mixed the string format handling into the main smb logic. You can also
never get it right as with SMB the way you determine what format a
string is in is based on bits in the packet header, which you don't
have access to when you are deep in some utility function. That's why
the decision must be moved to the parsing code.

Another interesting area is string handling in the RPC code. That will
require a set of functions similar to srvstr_*() but not quite the
same. In particular, the RPC code needs to cope with ucs2 in either
big or little endian format (it is negotiated as part of RPC, but is
not negotiated in the main SMB code). So we will probably create a set
of rpcstr_() functions. Luckily the RPC code already has similar
functions in all the right places (it has always needed them as it is
always ucs2).

Cheers, Tridge