the big character set change

Wed Jul 4 09:18:21 GMT 2001

I've just committed the changes which attempt to bring some sanity to
the handling of character sets in Samba. Much of this work was done by
Igor Vergeichik <iverg at mail.ru> and I would like to thank Igor and his
company (ApplianceWare Inc) for their extrememly useful contribution.

A few weeks back I sent a message which outlined the basic plan for
converting Samba to use utf-8 and eventually ucs-2 internally. Since
then I have changed the plans a bit, and my commit is based on the new
plans, so pay attention to the following if you plan on committing any
code into Samba in the near future!

In my previous posted plan I proposed that we should have 4 character
sets in Samba. They were called "internal", "ucs2", "dos" and
"unix". I proposed that intternal should be utf-8 initially, and move
to ucs-2 at some future date. I have now realised that the split
between "internal" and "unix" is not really necessary and is very
error prone. I spent most of last week trying to get this split right
by erecting barriers between the internals of Samba and all OS calls
and found it was *very* difficult to get 100% right (think carefully
about all the *printf functions and especially their return values for
an example of what is hard).

I have now decided that we will have 3 character sets. They are called
"dos", "unix" and "ucs2". 

The "dos" character set is the one used by clients on the wire when
they are not using unicode for a string. This will also be used for
the encryption routines that work on 8-bit password fields. You can
control what character set this is using the "dos charset" option in
smb.conf. It defaults to ASCII and while it could in theory be
multi-byte there are many fixed-width 8.3 fields in the protocol that
will use this character set and a multi-byte character set will
probably break for those functions.

The "unix" character set is the one used by all internal strings in
Samba. It defaults to ASCII but can be a multi-byte character set. I
expect that some systems will set this to UTF8 and others will set it
to BIG5, SJIS or one of the european character sets. The only
restrictions on this character set are:
 * it must be null-terminated which means it can't contain any 0 bytes
   in any non-termination characters.
 * it must be "C compatible", so that string constants in C don't need
   conversion to be used in this character set.
 * when you uppercase or lowercase a character in this set it must not
   become longer than the original character.

I checked and both UTF-8 and BIG5 do obey these restrictions, but SJIS
fails the 3rd restriction for a single ucs2 character (number 0x345,
the "COMBINING GREEK YPOGEGRAMMENI"). If you use "unix charset = SJIS"
then don't use that character in Samba :)

Inside samba all input and output of strings to/from buffers which
will be transmitted on the wire should go via a pull_ or push_
function. See lib/charcnv.c for a full list of these functions and
look at the existing code for examples of their use. Make sure you are
careful to use the STR_TERMINATE flag when it is needed.

For windows clients Samba will negotiate ucs2 on the wire, but will
convert strings to the "unix" character set for internal
storage. There are some places (mostly in the rpc code) that will
internally store strings in ucs-2 and this will probably become more
common over the next few months.

For functions that have to look at strings at the byte level Iverg and
I have been recoding them to first convert to ucs2 as this is *much*
easier to do byte level accesses on. See for example lib/ms_fnmatch.c
for how to do this sort of conversion. The problem with this approach
is that it is slow, but we will have to put up with that for now until
we have some accelerators in place, or special case code for 7 bit
environments.

There are lots of remaining issues that we have not dealt with yet, in
particular:

- name mangling is broken with multi-byte character sets
- there are still lots of places that do byte-level access to strings
  are are not multi-byte safe
- there are probably some places where dos strings leak into unix
  strings

I've been using a special character set called "weird" to find these
issues. "weird" is like 7-bit ascii but encodes some partcular
characters in a very strange way. Right now I have it encoding "q" as
"^q^". I have also been experimenting with encoding characters as
things like "\\.ZZ.\\^" to try to break the code that looks for '\'
and '.' characters in strings without taking care with multi-byte
(these characters can occur as elements of a multi-byte string).  To
use "weird" just set "unix charset = weird" and watch the fireworks.

I fully expect that the head branch is probably now broken in lots of
ways, especially for multi-byte languages but I think that we at least
now have an infrastructure where these problems are fixable. The patch
removed over ten thousand lines of code from Samba and you can't do
that without breaking something.

That's it for now. Hack on it, test it and see if it works for you.

Cheers, Tridge