Managing DNs in libads only in utf8

Tue Feb 27 13:29:33 GMT 2007

On Mon, 2007-02-26 at 23:14 -0800, Jeremy Allison wrote:
> On Tue, Feb 27, 2007 at 01:50:23AM -0500, simo wrote:
> > Hello technical people,
> > 
> > after a report about a possible problem with how we manage DNs,
> > I discovered we currently may have some problems in case "unix charset"
> > is not set to UTF-8 and we are using security = ads. *
> > 
> > The problem is that we always convert everything coming out of ldap to
> > the local unix charset and then we convert** it back utf8 before using
> > it (see ads_get_dn()).
> > 
> > The problem in doing this is that we convert some DN this way:
> > utf8 -> local -> utf8
> > 
> > If the local unix charset is not able to represent one of the characters
> > of the DN, we actually corrupt the DN by doing the double conversion.
> 
> Ok, what kind of things break here ?

It was part of the email, group membership may break in security = ads
if the DN of a user contain non convertible characters as we convert a
DN this way: utf8 -> unix charset -> utf8, at the end of the proces
origDN != finalDN and so using finalDN to query AD fails.

> This is *exactly* the same problem that
> people have with filenames/usernames when
> using SJIS or EUC (Japanese character sets)
> as Samba unix charset when mixing with
> Windows clients that might send UTF16 names
> not compatible with SJIS or EUC names the
> server is using.

No it is not, there is a big difference. The difference is that these
people can't accept utf8 filenames, while DNs are completely internal in
this case and the user never see them.

> What do people do who want these charsets
> do ? They live with it, as the advantage to
> them of having SJIS or EUC on the server outweighs
> the advantage of utf8. They just ensure the
> clients "don't do that".

It's a different situation, admins can have control of their files, they
do not have control of the directory and fixinf _this_ problem is
possible. Not fixing it just for fear imho is not correct.

> So before you go down this route I'd like
> a good example of what will unexpectedly
> fail vs. the complexity of internally "remembering"
> some strings are now natively utf8 internally
> rather than "unix charset". Remember you need
> to track this and convert across all boundaries.

I know, if you care the read the mail, you will se I did that, and
checked each and every usage, and made sure nothing leaks this way.

> Right now it's simple - internal -> onto wire means
> convert from unix -> utf8, wire -> internal means
> convert utf8 -> unix. If you blur that boundary
> I think you will break more than you fix.

First of all for ldap it is NOT true we convert on the wire boundary. We
convert on ldap libraries boundary, and we have to remember to convert
back and forth these libraries already, if you look at some code you
will find your own comments and code dealing with this already (code my
patch removes and simplify btw).
So given we already have an artificial boundary in the ldap case, I
think that shifting it carefully, and meanwhile improving the code, is
not bad. We have a bug we can easily squash and make users happy without
fundamentally change the code, I hope you don't start to be as religious
as Linus :) with the kernel wrt POSIX (I just yesterday read a very
depressing 2004 thread with Tridge and Linus about case insensitivity).
At least POSIX is a well defined standard, our internal convention is
just a matter of sanity and we can keep it sane shifting the boundary
just a bit.

Simo.

-- 
Simo Sorce
Samba Team GPL Compliance Officer
email: idra at samba.org
http://samba.org