CH_DISPLAY and gettext

Thu Jun 23 11:16:01 MDT 2011

From: Michael Adam <obnox at samba.org>
Date: Thu, 23 Jun 2011 15:04:27 +0200

> I have some points of criticism with CH_UNIX used as charset to
> internally store strings (file names, user names, etc) in memory
> as well as in databases. I am sure that there have been very good
> reasons for introducing CH_UNIX as internal encoding in the past,
> but I am questioning this anyways:
> 
> 1) This yields information too early!
>    The mapping Unicode --> CH_UNIX is potentially lossy.
>    E.g. if I use ASCII or some latin/iso charset, then some characters
>    will not be displayable. Maybe even unmarshalling will fail
>    so users will not be available, depending on the value of CH_UNIX.
> 
> 2) Storing our internal databases (s3 eg: group mapping, passdb)
>    in CH_UNIX is a very bad thing: This encoding might be changed
>    by the administrators and the databases are not coverted
>    automatically. Neither is the file system but there is convmv
>    for this. But for the internal DBs there is not even a
>    conversion tools. I have to look which other databases are
>    stored in which encoding, especially samba4.
> 
>    I have been in quite cumbersome manual db repair due to this
>    problem more than once already. This was really bad!
> 
> In order to fix #2, there are two options:
> 
> a) Change the dbs (individually) to convert from internal
>    representation to UTF8 (or UTF16 maybe), before storing.
> 
> b) change samba to internally store everyhting in UTF8
>    and then write out the DBs unchanged.
>    For every target that needs a special encoding (like
>    the file system needing CH_UNIX), we'd then need to convert
>    before accessing the target (like I detailed in my
>    previous emails).
> 
> In either case we also need a encoding conversion tool for each
> such database, since afaik we can not reliably autodetect
> the encoding of the stored data.
> 
> In order to fix #1 though, option (b) is the only possible way.
> 
> So my wish would be to convert all of samba to use UTF8
> internally (I'd be ready to discuss a different unicode
> charset like UTF16), and convert to CH_UNIX for the necessary
> communication interfaces with the outside.
> 
> I hope this makes my argument a little clearer.
> 
> Cheers - Michael

That's what I (and my friends) insisted several years ago:
  http://lists.samba.org/archive/samba-technical/2004-March/034638.html
  http://lists.samba.org/archive/samba-technical/2004-March/034742.html

Internal charset should be fixed. UTF-8 is acceptable but UTF-16 may
be better because UTF-16 is more suitable for string manipulation than
UTF-8.

---
TAKAHASHI Motonobu <monyo at monyo.com> / @damemonyo
  http://damedame.monyo.com/ / http://facebook.com/monyot