CH_DISPLAY and gettext
obnox at samba.org
Thu Jun 23 07:04:27 MDT 2011
Andrew Bartlett wrote:
> Just a note to the list to say that I worked things out with Michael
> over IRC, and we agreed to remove CH_DISPLAY, and consider if we should
> make new internal persistent databases UTF8.
Yeah, thanks for the good discussion!
You finally talked me down. ;-)
No, seriously, I did not have any real objections any more
about the patches. They do fix the issues in a pragmatic and
reasonable way (for now), but I have to make the following
summary of my main points, so it does not get lost:
1) The existence of CH_DISPLAY was not the source of the problems
observed. It just complicated the situation. Its implementation
was ok, the only bug being the default initialization to "ASCII".
2) The actual bug was that CH_UNIX as well as UTF8 could appear
as an input charset for d_printf. This was fixed by your last
commit changing the output encoding of gettext in "net" to
CH_UNIX (instead of UTF8). At the current point I am willing
to believe (but I will check it if I find the time) that there
are no other sources of UTF8 strings apprearing in d_printf
or debug messages.
So now the input of d_printf and DEBUG should be CH_UNIX only.
I have some points of criticism with CH_UNIX used as charset to
internally store strings (file names, user names, etc) in memory
as well as in databases. I am sure that there have been very good
reasons for introducing CH_UNIX as internal encoding in the past,
but I am questioning this anyways:
1) This yields information too early!
The mapping Unicode --> CH_UNIX is potentially lossy.
E.g. if I use ASCII or some latin/iso charset, then some characters
will not be displayable. Maybe even unmarshalling will fail
so users will not be available, depending on the value of CH_UNIX.
2) Storing our internal databases (s3 eg: group mapping, passdb)
in CH_UNIX is a very bad thing: This encoding might be changed
by the administrators and the databases are not coverted
automatically. Neither is the file system but there is convmv
for this. But for the internal DBs there is not even a
conversion tools. I have to look which other databases are
stored in which encoding, especially samba4.
I have been in quite cumbersome manual db repair due to this
problem more than once already. This was really bad!
In order to fix #2, there are two options:
a) Change the dbs (individually) to convert from internal
representation to UTF8 (or UTF16 maybe), before storing.
b) change samba to internally store everyhting in UTF8
and then write out the DBs unchanged.
For every target that needs a special encoding (like
the file system needing CH_UNIX), we'd then need to convert
before accessing the target (like I detailed in my
In either case we also need a encoding conversion tool for each
such database, since afaik we can not reliably autodetect
the encoding of the stored data.
In order to fix #1 though, option (b) is the only possible way.
So my wish would be to convert all of samba to use UTF8
internally (I'd be ready to discuss a different unicode
charset like UTF16), and convert to CH_UNIX for the necessary
communication interfaces with the outside.
I hope this makes my argument a little clearer.
Cheers - Michael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 206 bytes
Desc: not available
More information about the samba-technical