CH_DISPLAY and gettext

Wed Jun 22 06:37:50 MDT 2011

Hi Andrew (and others),

Andrew Bartlett wrote:
> I've been looking closely at the implementation of internationalisation
> in Samba, and I'm rather confused about how it is expected to work
> except in a UTF8 locale.
> 
> [...]
> 
> 'net' and 'pam_wibnindd' are internationalised with libintl/gettext,
> with .mo files being installed as part of make install (except in the
> waf builds - a bug). 
> 
> SWAT [...]
>
> Finally, most of Samba uses d_printf(), which causes strings to be
> converted from UTF8 (the source format) to CH_DISPLAY.
> 
> My concern is about the combination of these two elements.  When a
> string is internationalised into (say) German, the messages are placed
> in a .mo file as UTF8.  
> 
> When we read file-names to display from a remote server however, these
> strings are in unix charset. 
> 
> Then, when we d_printf() these strings, we convert them into CH_DISPLAY,
> based on the system locale or the LANG environment variable.
> 
> The trouble is, what is the source charset, where CH_DISPLAY is not
> CH_UNIX? 

I think the problem is not what happens when CH_DISPLAY != CH_UNIX
since these are currently both on different sides of the
conversion, but what happens when CH_UNIX != UTF8. Because then
the input of the conversion call is a mix of messages
potentially containing multibyte characters in UTF8 and file
names in CH_UNIX (latin1, some japanese encoding, ...).

Note that these UTF8-characters in messages do not only occur
with internationalized messages via gettext but also when for
instance user names with umlauts are retrieved from a domain
controller. These are converted by winbindd from the windows
encoding to UTF8 internally.

There is by the way a bug that I observerd triggered by your
recent changes: the wbinfo tool does not call lp_load(), so it
does not read the config and does not initialize iconv.
So when a name with multibyte characters arrives in wbinfo
(e.g. via wbinfo -u  or wbinfo -s <SID>), then this UTF8-string
is handed to convert_string_talloc(), which initializes the
iconv handle (global_iconv_handle) with the current default
values of dos-charset = ASCII, unix-charset = UTF-8 and
display-charset = ASCII. So the wbinfo's d_printf stumbles
over illegal multibyte charaters when trying to convert to
ASCII. The attached patch changes the default initialization
of the display charset to "LOCALE", fixing the bug for me.
Maybe this can be applied while we are still looking for a
better overall solution.

Coming back to the separation between CH_UNIX and the internal
UTF8 encoding: I think this is the actual problem we are facing.
Looking from a very high perspective, what I think we should do
is keeping all strings as UTF8 internally, also filenames.
We do this already for everything apart filenames (if not, then
I would consider this a bug). For filenames from
non-UTF8-systems, this would of course mean that we would have
to convert between CH_UNIX and UTF8 before/after syscalls
that handle file names such as open and friends. For a "standard"
system of today that uses UTF8, these conversions would be skipped.

Alternatively, we could keep the file names in CH_UNIX and convert
them to UTF8 before all d_printf and debug message output calls.
I personally like the all-strings-are-utf8-internally approach
much better. VFS modules that replace the basic syscalls will
have to be very careful here to do the proper conversions.

Anyhow, the current status with messages composed from a mix
of CH_UNIX and UTF8 components is simply buggy for non-UTF8
systems.

If (and when) we can reach a state where we drop CH_UNIX and
always do UTF8, is unclear to me. At least currently, it does
not seem to be a real option.

Coming back to the question of removing CH_DISPLAY:
As I said, to my understanding this is secondary to the
CH_UNIX vs. UTF8 problem. CH_DISPLAY adds a way to have the
shell locale different from the charset used in the file system.
Which is rarely useful. It was said in another mail that this
separation is still widely in many japanese setups. So maybe
we can not remove CH_DISPLAY?

I personally would be ok with removing it, but then there are
two options:

1. replace CH_DISPLAY with CH_UNIX, i.e. always doing output
   the same charset as the one the file system is using.

2. replace CH_DISPLAY with UTF8, i.e. always output in the
   charset used internally and do no conversion ourselves.

While option #1 may seem the more natural choice, option #2
offers the greatest flexibility (apart from the advantage
of not losing information about the characters):

By an idea that Metze told me, one could use some sort of
display wrapper (maybe hooking onto fds 0,1,2 ...) that would
do the desired conversion from UTF8 to whatever encoding
_after_ the d_printf (or similar) output has happened.

So my idea is this (and I would be glad to help implement it
as much as my time permits):

1. always use UTF8 internally, also for file names
2. hence add conversion from / to CH_UNIX before/after some syscalls
   (if CH_UNIX != UTF8)
3. I don't yet see that we do really have a problem with
   CH_DISPLAY, but if we decide to remove CH_DISPLAY then we need
   to decide on one of the options above.

This would mean that we could continue using the gettext macros
for internationalization, given all input was correctly in UTF8.
Then use d_printf to convert the stuff to CH_DISPLAY (or CH_UNIX)
or leave it at UTF8 depending on the decision above, but in any
case obtaining correct results. For the syscalls, the filenames
would have to be converted between CH_UNIX and UTF8 if these are
different.

What have I missed in this analysis?

Cheers - Michael

> (1) We could say that the source charset is UTF8, in which case UNIX
> charset filenames would be wrong. 
> (2) We could say the source charset is UNIX, but then the gettext
> message will be wrong.
> (3) We could assume that the internationalised message format string is
> DISPLAY, but the arguments are UNIX, but then we would have to 'ban'
> using N_() to translate %s arguments.  Hopefully we never put a unix
> (not C) string directly into d_printf() in this case. 
> 
> Perhaps someone with a longer background in this area might be able to
> help me untangle this mess?
> 
> In my patches to use a common d_printf() I originally implemented (1),
> but attach a patch to fix that up, and one to do (2).  I attempted (and
> failed) to implement (3).
> 
> Or, is it simply not practical to actually have 'display
> charset'/CH_DISPLAY not equal 'unix charset'/CH_UNIX, and we should
> simply remove the parameter?
> 
> Thanks,
> 
> Andrew Bartlett
> -- 
> Andrew Bartlett                                http://samba.org/~abartlet/
> Authentication Developer, Samba Team           http://samba.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lib-util-charset-initialize-the-display-charset-as-L.patch
Type: text/x-patch
Size: 1137 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20110622/91854848/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20110622/91854848/attachment.pgp>