CH_DISPLAY and gettext

Wed Jun 22 15:42:28 MDT 2011

I have to correct myself in one point:
It is not true that the names obtaind from the dc
(via LSA calls, say) are converted to UTF8, but
the are really converted into CH_UNIX.
(ndr_push_charset and ndr_pull_charset convert from
and to from CH_UNIX, respectively).

But the main reasoning is still valid.
We blend strings from CH_UNIX and UTF8 into
the same string that should be processed
further for output. And so forth.

Cheers - Michael

Michael Adam wrote:
> Hi Andrew (and others),
> 
> Andrew Bartlett wrote:
> > I've been looking closely at the implementation of internationalisation
> > in Samba, and I'm rather confused about how it is expected to work
> > except in a UTF8 locale.
> > 
> > [...]
> > 
> > 'net' and 'pam_wibnindd' are internationalised with libintl/gettext,
> > with .mo files being installed as part of make install (except in the
> > waf builds - a bug). 
> > 
> > SWAT [...]
> >
> > Finally, most of Samba uses d_printf(), which causes strings to be
> > converted from UTF8 (the source format) to CH_DISPLAY.
> > 
> > My concern is about the combination of these two elements.  When a
> > string is internationalised into (say) German, the messages are placed
> > in a .mo file as UTF8.  
> > 
> > When we read file-names to display from a remote server however, these
> > strings are in unix charset. 
> > 
> > Then, when we d_printf() these strings, we convert them into CH_DISPLAY,
> > based on the system locale or the LANG environment variable.
> > 
> > The trouble is, what is the source charset, where CH_DISPLAY is not
> > CH_UNIX? 
> 
> I think the problem is not what happens when CH_DISPLAY != CH_UNIX
> since these are currently both on different sides of the
> conversion, but what happens when CH_UNIX != UTF8. Because then
> the input of the conversion call is a mix of messages
> potentially containing multibyte characters in UTF8 and file
> names in CH_UNIX (latin1, some japanese encoding, ...).
> 
> Note that these UTF8-characters in messages do not only occur
> with internationalized messages via gettext but also when for
> instance user names with umlauts are retrieved from a domain
> controller. These are converted by winbindd from the windows
> encoding to UTF8 internally.
> 
> There is by the way a bug that I observerd triggered by your
> recent changes: the wbinfo tool does not call lp_load(), so it
> does not read the config and does not initialize iconv.
> So when a name with multibyte characters arrives in wbinfo
> (e.g. via wbinfo -u  or wbinfo -s <SID>), then this UTF8-string
> is handed to convert_string_talloc(), which initializes the
> iconv handle (global_iconv_handle) with the current default
> values of dos-charset = ASCII, unix-charset = UTF-8 and
> display-charset = ASCII. So the wbinfo's d_printf stumbles
> over illegal multibyte charaters when trying to convert to
> ASCII. The attached patch changes the default initialization
> of the display charset to "LOCALE", fixing the bug for me.
> Maybe this can be applied while we are still looking for a
> better overall solution.
> 
> Coming back to the separation between CH_UNIX and the internal
> UTF8 encoding: I think this is the actual problem we are facing.
> Looking from a very high perspective, what I think we should do
> is keeping all strings as UTF8 internally, also filenames.
> We do this already for everything apart filenames (if not, then
> I would consider this a bug). For filenames from
> non-UTF8-systems, this would of course mean that we would have
> to convert between CH_UNIX and UTF8 before/after syscalls
> that handle file names such as open and friends. For a "standard"
> system of today that uses UTF8, these conversions would be skipped.
> 
> Alternatively, we could keep the file names in CH_UNIX and convert
> them to UTF8 before all d_printf and debug message output calls.
> I personally like the all-strings-are-utf8-internally approach
> much better. VFS modules that replace the basic syscalls will
> have to be very careful here to do the proper conversions.
> 
> Anyhow, the current status with messages composed from a mix
> of CH_UNIX and UTF8 components is simply buggy for non-UTF8
> systems.
> 
> If (and when) we can reach a state where we drop CH_UNIX and
> always do UTF8, is unclear to me. At least currently, it does
> not seem to be a real option.
> 
> Coming back to the question of removing CH_DISPLAY:
> As I said, to my understanding this is secondary to the
> CH_UNIX vs. UTF8 problem. CH_DISPLAY adds a way to have the
> shell locale different from the charset used in the file system.
> Which is rarely useful. It was said in another mail that this
> separation is still widely in many japanese setups. So maybe
> we can not remove CH_DISPLAY?
> 
> I personally would be ok with removing it, but then there are
> two options:
> 
> 1. replace CH_DISPLAY with CH_UNIX, i.e. always doing output
>    the same charset as the one the file system is using.
> 
> 2. replace CH_DISPLAY with UTF8, i.e. always output in the
>    charset used internally and do no conversion ourselves.
> 
> While option #1 may seem the more natural choice, option #2
> offers the greatest flexibility (apart from the advantage
> of not losing information about the characters):
> 
> By an idea that Metze told me, one could use some sort of
> display wrapper (maybe hooking onto fds 0,1,2 ...) that would
> do the desired conversion from UTF8 to whatever encoding
> _after_ the d_printf (or similar) output has happened.
> 
> 
> So my idea is this (and I would be glad to help implement it
> as much as my time permits):
> 
> 1. always use UTF8 internally, also for file names
> 2. hence add conversion from / to CH_UNIX before/after some syscalls
>    (if CH_UNIX != UTF8)
> 3. I don't yet see that we do really have a problem with
>    CH_DISPLAY, but if we decide to remove CH_DISPLAY then we need
>    to decide on one of the options above.
> 
> This would mean that we could continue using the gettext macros
> for internationalization, given all input was correctly in UTF8.
> Then use d_printf to convert the stuff to CH_DISPLAY (or CH_UNIX)
> or leave it at UTF8 depending on the decision above, but in any
> case obtaining correct results. For the syscalls, the filenames
> would have to be converted between CH_UNIX and UTF8 if these are
> different.
> 
> What have I missed in this analysis?
> 
> Cheers - Michael
> 
> > (1) We could say that the source charset is UTF8, in which case UNIX
> > charset filenames would be wrong. 
> > (2) We could say the source charset is UNIX, but then the gettext
> > message will be wrong.
> > (3) We could assume that the internationalised message format string is
> > DISPLAY, but the arguments are UNIX, but then we would have to 'ban'
> > using N_() to translate %s arguments.  Hopefully we never put a unix
> > (not C) string directly into d_printf() in this case. 
> > 
> > Perhaps someone with a longer background in this area might be able to
> > help me untangle this mess?
> > 
> > In my patches to use a common d_printf() I originally implemented (1),
> > but attach a patch to fix that up, and one to do (2).  I attempted (and
> > failed) to implement (3).
> > 
> > Or, is it simply not practical to actually have 'display
> > charset'/CH_DISPLAY not equal 'unix charset'/CH_UNIX, and we should
> > simply remove the parameter?
> > 
> > Thanks,
> > 
> > Andrew Bartlett
> > -- 
> > Andrew Bartlett                                http://samba.org/~abartlet/
> > Authentication Developer, Samba Team           http://samba.org

> From 9449234524662d4bdd6c89bf70d2c73c19410363 Mon Sep 17 00:00:00 2001
> From: Michael Adam <obnox at samba.org>
> Date: Wed, 22 Jun 2011 14:04:07 +0200
> Subject: [PATCH] lib/util/charset: initialize the display charset as LOCALE by default
> 
> When the display charset is initialized to ASCII as the default fallback,
> then wbinfo can for instance not print windows user names with non-ascii
> characters. This fixes it by using the locale to determine the standard
> output character set for messages.
> 
> Signed-off-by: Michael Adam <obnox at samba.org>
> ---
>  lib/util/charset/codepoints.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/lib/util/charset/codepoints.c b/lib/util/charset/codepoints.c
> index 71611bf..193af0e 100644
> --- a/lib/util/charset/codepoints.c
> +++ b/lib/util/charset/codepoints.c
> @@ -168,7 +168,7 @@ struct smb_iconv_handle *get_iconv_handle(void)
>  {
>  	if (global_iconv_handle == NULL)
>  		global_iconv_handle = smb_iconv_handle_reinit(talloc_autofree_context(),
> -									"ASCII", "UTF-8", "ASCII", true, NULL);
> +									"ASCII", "UTF-8", "LOCALE", true, NULL);
>  	return global_iconv_handle;
>  }
>  
> -- 
> 1.7.1
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20110622/380d0783/attachment.pgp>