CH_DISPLAY and gettext

Thu Jun 23 07:04:27 MDT 2011

Hi Andrew,

Andrew Bartlett wrote:
> Just a note to the list to say that I worked things out with Michael
> over IRC, and we agreed to remove CH_DISPLAY, and consider if we should
> make new internal persistent databases UTF8.

Yeah, thanks for the good discussion!
You finally talked me down. ;-)

No, seriously, I did not have any real objections any more
about the patches. They do fix the issues in a pragmatic and
reasonable way (for now), but I have to make the following
summary of my main points, so it does not get lost:

1) The existence of CH_DISPLAY was not the source of the problems
   observed. It just complicated the situation. Its implementation
   was ok, the only bug being the default initialization to "ASCII".

2) The actual bug was that CH_UNIX as well as UTF8 could appear
   as an input charset for d_printf. This was fixed by your last
   commit changing the output encoding of gettext in "net" to
   CH_UNIX (instead of UTF8). At the current point I am willing
   to believe (but I will check it if I find the time) that there
   are no other sources of UTF8 strings apprearing in d_printf
   or debug messages.
   So now the input of d_printf and DEBUG should be CH_UNIX only.

I have some points of criticism with CH_UNIX used as charset to
internally store strings (file names, user names, etc) in memory
as well as in databases. I am sure that there have been very good
reasons for introducing CH_UNIX as internal encoding in the past,
but I am questioning this anyways:

1) This yields information too early!
   The mapping Unicode --> CH_UNIX is potentially lossy.
   E.g. if I use ASCII or some latin/iso charset, then some characters
   will not be displayable. Maybe even unmarshalling will fail
   so users will not be available, depending on the value of CH_UNIX.

2) Storing our internal databases (s3 eg: group mapping, passdb)
   in CH_UNIX is a very bad thing: This encoding might be changed
   by the administrators and the databases are not coverted
   automatically. Neither is the file system but there is convmv
   for this. But for the internal DBs there is not even a
   conversion tools. I have to look which other databases are
   stored in which encoding, especially samba4.

   I have been in quite cumbersome manual db repair due to this
   problem more than once already. This was really bad!

In order to fix #2, there are two options:

a) Change the dbs (individually) to convert from internal
   representation to UTF8 (or UTF16 maybe), before storing.

b) change samba to internally store everyhting in UTF8
   and then write out the DBs unchanged.
   For every target that needs a special encoding (like
   the file system needing CH_UNIX), we'd then need to convert
   before accessing the target (like I detailed in my
   previous emails).

In either case we also need a encoding conversion tool for each
such database, since afaik we can not reliably autodetect
the encoding of the stored data.

In order to fix #1 though, option (b) is the only possible way.

So my wish would be to convert all of samba to use UTF8
internally (I'd be ready to discuss a different unicode
charset like UTF16), and convert to CH_UNIX for the necessary
communication interfaces with the outside.

I hope this makes my argument a little clearer.

Cheers - Michael

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20110623/59039535/attachment.pgp>