CH_DISPLAY and gettext
Michael Adam
obnox at samba.org
Wed Jun 22 15:42:28 MDT 2011
I have to correct myself in one point:
It is not true that the names obtaind from the dc
(via LSA calls, say) are converted to UTF8, but
the are really converted into CH_UNIX.
(ndr_push_charset and ndr_pull_charset convert from
and to from CH_UNIX, respectively).
But the main reasoning is still valid.
We blend strings from CH_UNIX and UTF8 into
the same string that should be processed
further for output. And so forth.
Cheers - Michael
Michael Adam wrote:
> Hi Andrew (and others),
>
> Andrew Bartlett wrote:
> > I've been looking closely at the implementation of internationalisation
> > in Samba, and I'm rather confused about how it is expected to work
> > except in a UTF8 locale.
> >
> > [...]
> >
> > 'net' and 'pam_wibnindd' are internationalised with libintl/gettext,
> > with .mo files being installed as part of make install (except in the
> > waf builds - a bug).
> >
> > SWAT [...]
> >
> > Finally, most of Samba uses d_printf(), which causes strings to be
> > converted from UTF8 (the source format) to CH_DISPLAY.
> >
> > My concern is about the combination of these two elements. When a
> > string is internationalised into (say) German, the messages are placed
> > in a .mo file as UTF8.
> >
> > When we read file-names to display from a remote server however, these
> > strings are in unix charset.
> >
> > Then, when we d_printf() these strings, we convert them into CH_DISPLAY,
> > based on the system locale or the LANG environment variable.
> >
> > The trouble is, what is the source charset, where CH_DISPLAY is not
> > CH_UNIX?
>
> I think the problem is not what happens when CH_DISPLAY != CH_UNIX
> since these are currently both on different sides of the
> conversion, but what happens when CH_UNIX != UTF8. Because then
> the input of the conversion call is a mix of messages
> potentially containing multibyte characters in UTF8 and file
> names in CH_UNIX (latin1, some japanese encoding, ...).
>
> Note that these UTF8-characters in messages do not only occur
> with internationalized messages via gettext but also when for
> instance user names with umlauts are retrieved from a domain
> controller. These are converted by winbindd from the windows
> encoding to UTF8 internally.
>
> There is by the way a bug that I observerd triggered by your
> recent changes: the wbinfo tool does not call lp_load(), so it
> does not read the config and does not initialize iconv.
> So when a name with multibyte characters arrives in wbinfo
> (e.g. via wbinfo -u or wbinfo -s <SID>), then this UTF8-string
> is handed to convert_string_talloc(), which initializes the
> iconv handle (global_iconv_handle) with the current default
> values of dos-charset = ASCII, unix-charset = UTF-8 and
> display-charset = ASCII. So the wbinfo's d_printf stumbles
> over illegal multibyte charaters when trying to convert to
> ASCII. The attached patch changes the default initialization
> of the display charset to "LOCALE", fixing the bug for me.
> Maybe this can be applied while we are still looking for a
> better overall solution.
>
> Coming back to the separation between CH_UNIX and the internal
> UTF8 encoding: I think this is the actual problem we are facing.
> Looking from a very high perspective, what I think we should do
> is keeping all strings as UTF8 internally, also filenames.
> We do this already for everything apart filenames (if not, then
> I would consider this a bug). For filenames from
> non-UTF8-systems, this would of course mean that we would have
> to convert between CH_UNIX and UTF8 before/after syscalls
> that handle file names such as open and friends. For a "standard"
> system of today that uses UTF8, these conversions would be skipped.
>
> Alternatively, we could keep the file names in CH_UNIX and convert
> them to UTF8 before all d_printf and debug message output calls.
> I personally like the all-strings-are-utf8-internally approach
> much better. VFS modules that replace the basic syscalls will
> have to be very careful here to do the proper conversions.
>
> Anyhow, the current status with messages composed from a mix
> of CH_UNIX and UTF8 components is simply buggy for non-UTF8
> systems.
>
> If (and when) we can reach a state where we drop CH_UNIX and
> always do UTF8, is unclear to me. At least currently, it does
> not seem to be a real option.
>
> Coming back to the question of removing CH_DISPLAY:
> As I said, to my understanding this is secondary to the
> CH_UNIX vs. UTF8 problem. CH_DISPLAY adds a way to have the
> shell locale different from the charset used in the file system.
> Which is rarely useful. It was said in another mail that this
> separation is still widely in many japanese setups. So maybe
> we can not remove CH_DISPLAY?
>
> I personally would be ok with removing it, but then there are
> two options:
>
> 1. replace CH_DISPLAY with CH_UNIX, i.e. always doing output
> the same charset as the one the file system is using.
>
> 2. replace CH_DISPLAY with UTF8, i.e. always output in the
> charset used internally and do no conversion ourselves.
>
> While option #1 may seem the more natural choice, option #2
> offers the greatest flexibility (apart from the advantage
> of not losing information about the characters):
>
> By an idea that Metze told me, one could use some sort of
> display wrapper (maybe hooking onto fds 0,1,2 ...) that would
> do the desired conversion from UTF8 to whatever encoding
> _after_ the d_printf (or similar) output has happened.
>
>
> So my idea is this (and I would be glad to help implement it
> as much as my time permits):
>
> 1. always use UTF8 internally, also for file names
> 2. hence add conversion from / to CH_UNIX before/after some syscalls
> (if CH_UNIX != UTF8)
> 3. I don't yet see that we do really have a problem with
> CH_DISPLAY, but if we decide to remove CH_DISPLAY then we need
> to decide on one of the options above.
>
> This would mean that we could continue using the gettext macros
> for internationalization, given all input was correctly in UTF8.
> Then use d_printf to convert the stuff to CH_DISPLAY (or CH_UNIX)
> or leave it at UTF8 depending on the decision above, but in any
> case obtaining correct results. For the syscalls, the filenames
> would have to be converted between CH_UNIX and UTF8 if these are
> different.
>
> What have I missed in this analysis?
>
> Cheers - Michael
>
> > (1) We could say that the source charset is UTF8, in which case UNIX
> > charset filenames would be wrong.
> > (2) We could say the source charset is UNIX, but then the gettext
> > message will be wrong.
> > (3) We could assume that the internationalised message format string is
> > DISPLAY, but the arguments are UNIX, but then we would have to 'ban'
> > using N_() to translate %s arguments. Hopefully we never put a unix
> > (not C) string directly into d_printf() in this case.
> >
> > Perhaps someone with a longer background in this area might be able to
> > help me untangle this mess?
> >
> > In my patches to use a common d_printf() I originally implemented (1),
> > but attach a patch to fix that up, and one to do (2). I attempted (and
> > failed) to implement (3).
> >
> > Or, is it simply not practical to actually have 'display
> > charset'/CH_DISPLAY not equal 'unix charset'/CH_UNIX, and we should
> > simply remove the parameter?
> >
> > Thanks,
> >
> > Andrew Bartlett
> > --
> > Andrew Bartlett http://samba.org/~abartlet/
> > Authentication Developer, Samba Team http://samba.org
> From 9449234524662d4bdd6c89bf70d2c73c19410363 Mon Sep 17 00:00:00 2001
> From: Michael Adam <obnox at samba.org>
> Date: Wed, 22 Jun 2011 14:04:07 +0200
> Subject: [PATCH] lib/util/charset: initialize the display charset as LOCALE by default
>
> When the display charset is initialized to ASCII as the default fallback,
> then wbinfo can for instance not print windows user names with non-ascii
> characters. This fixes it by using the locale to determine the standard
> output character set for messages.
>
> Signed-off-by: Michael Adam <obnox at samba.org>
> ---
> lib/util/charset/codepoints.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/lib/util/charset/codepoints.c b/lib/util/charset/codepoints.c
> index 71611bf..193af0e 100644
> --- a/lib/util/charset/codepoints.c
> +++ b/lib/util/charset/codepoints.c
> @@ -168,7 +168,7 @@ struct smb_iconv_handle *get_iconv_handle(void)
> {
> if (global_iconv_handle == NULL)
> global_iconv_handle = smb_iconv_handle_reinit(talloc_autofree_context(),
> - "ASCII", "UTF-8", "ASCII", true, NULL);
> + "ASCII", "UTF-8", "LOCALE", true, NULL);
> return global_iconv_handle;
> }
>
> --
> 1.7.1
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20110622/380d0783/attachment.pgp>
More information about the samba-technical
mailing list