the sorry saga of the talloc soname 'fix'

simo idra at samba.org
Tue Jul 7 03:45:32 GMT 2009


[long mail ahead, take your time]

On Mon, 2009-07-06 at 20:21 +1000, tridge at samba.org wrote: 
> Hi Simo,
> 
> I've now spent all day looking at libtalloc and how it interacts with
> what is currently in Ubuntu Jaunty. I have downloaded a Fedora image
> but haven't yet installed it to see if Fedora is as badly placed as
> Ubuntu is.
> 
> The result of my investigation is that libtalloc is a complete mess.

Nope it's not libtalloc that is a mess, it's bugs in the samba4 build
dependencies that are a mess. libtalloc standalone is fine.

For the record I found that the build problem is present also in the
F-11 samba4-libs package, I am working on fixing the problem in Fedora
too, and will push fixed libraries asap.

Also for the record, I pointed out these build problems no less than 2
SambaXPs ago when I found the horrible mess with the event library
globals. The mess was caused by the fact libevent was statically built
multiple times within samba4 itself so that 2 parts of the same binary
were actually using 2 different sets of libevent symbols duplicating
globals which were no more globals and making debugging almost
impossible.

Then an year ago when we started working on providing libraries for
openchange I pointed out to Jelmer that libdceprc&co had again build
problems as they were including static copies of talloc, tdb, tevent and
ldb (and in some case not all symbols were statically built in, also
they were not made static to the library, the library was exposing them
as public symbols).

This should have been fixed but apparently there were still some build
issues with alpha7.

> It turns out that with current Ubuntu we cannot completely avoid
> having both the old talloc and the new talloc in the same process at
> the same time. However, if we bump the .so number then at least the
> developers will get a warning about the mess.
> 
> I've put together some files for you to look at to give you some idea
> of just how bad this whole mess is. See 
> 
>   http://samba.org/tridge/talloc_mess/
> 
> The files are:
> 
>  - a libtalloc 1.2.0 (dev and lib) matching what is currently in Ubuntu
>    Jaunty
> 
>  - a libtalloc 1.4.0 (dev and lib) matching what is produced if we
>    followed your suggested course of action
> 
>  - a libtalloc 2.0.0 (dev and lib) matching what is produced if we
>    follow my preferred choice of using a new .so number
> 
>  - source tar balls with debian build rules for all of the above
> 
>  - a sample 'testtalloc' package that demonstrates the problems (deb
>    plus source)
> 
> The testtalloc package produces two binaries. One is called test_ldb,
> and it creates a ldb then tries to free it with talloc_free() which is
> about as simple a ldb program as you can have. The other is called
> test_mapi which initialises the MAPI subsystem from openchange then
> uses talloc_report_full() to show the memory that has been used.
> 
> I chose these two binaries as they demonstrate different types of
> brokenness in the way that talloc/ldb/mapi/samba/openchange etc have
> all been packaged. For example:
> 
>   - The libldb-samba4-0 package provides a libldb.so.0 which has a
>     built in static copy of talloc.
> 
>   - the libmapi.so package links to a dynamic libtalloc.so, but also
>     links to libdcerpc.so
> 
>   - libdcerpc.so has a staticly linked talloc built in
> 
>   - etc etc
> 
> The same type of brokenness is rife through all the various packages
> that use talloc currently.

I am sorry you wasted time building this testtalloc binary, as
demonstrating that including talloc statically in another library is not
necessary. It is evidently broken, there is no need to prove that.

The real problem I see is that you don't recognize that the soname
version is totally orthogonal to this problem.

> If we used the approach you are advocating, then all of these packages
> (ldb, openchange, mapi, samba etc) won't be marked as needing to be
> rebuilt. Yet they will all abort with no error message when you
> actually use them, because they will mix the two incompatible
> ABIs. Try the test_ldb and test_mapi binaries to see the abort.

Sorry Tridge, 
but so far you strongly advocated the soname bump because that way you
could install both version of the library.

Having 2 version of the library in the same process is exactly the same
kind of brokeness as having a static copy of talloc in a library that
lives in a process that also loads a dynamic version of the library.

> If we use the approach that I prefer, which is to change the .so
> number to 2, then at least the developers get a nice warning like
> this:
> 
>   /usr/bin/ld: warning: libtalloc.so.1, needed by /usr/lib/gcc/x86_64-linux-gnu/4.3.3/../../../../lib/libmapi.so, may conflict with libtalloc.so.2
> 
> So at least someone gets told that it won't work at build time, which
> gives some hope that it might get fixed.

This will happen only if talloc.so.1 is not available on the system, So
far you advocated having both as a solution to avoid rebuilding all
packages that depend on talloc.

But if you remove talloc.so.1 you will have to either remove all
dependencies, or rebuild all dependencies against the new library.

> If we up the .so number to 2 then you can also see the brokenness by
> looking at the dependencies, because we are explicitly marking the ABI
> as having changed. It is easy to see the brokenness using ldd, or by
> using dpkg. 

No for libraries that compiled talloc statically ldd will tell you
nothing. As for dpkg you may be lucky if someone explicitly marked
libtalloc as a dependency. But then it depends on how it was done.

normally manual dependencies are of the form: libtalloc >= 1.2.0

This will not trigger any check in dpkg if you want to install libtalloc
= 2.0.0 as 2.0.0 is > 1.2.0

> If we don't do this then we're saying "the ABI is the same" when it
> isn't. This is clearly shown by the abort in the test progams above,
> regardless of whether you install the 1.4.0 libtalloc or the 2.0.0
> libtalloc. 

No the abort above doesn't say anything about the ABI, it just tells you
that building talloc statically into those libraries is completely
broken as it is. Whether the dynamic version of talloc is called 1.4.0
or 2.0.0 makes no difference for that kind of brokenness, it's an
orthogonal problem.

> So even with your attempts to make the ABIs more similar by putting
> backward compatibility code into talloc.c we get aborts because the
> internal structures are not compatible (which is nicely caught by
> Metze's patch).

This happens only for the already broken libraries. For sane
binaries/libraries there is no problem at all. Try yourself to build
against libtalloc 1.2.0 and then install libtalloc 1.4.0, the
application will be just fine because the ABI *is* compatible.

>   Your attempts to make the ABIs compatible are not
> enough, and would pollute the code with a lot of cruft that serves no
> purpose, plus it will remove the warnings that developers that would
> otherwise get when things are going to go wrong with some of the
> libraries.

No, you are mixing ABI problems with broken libs problems.
If you mix these two things you can argue any solution, but they will
all be equally wrong, those libraries are simply broken.

If you want to argue about what soname version to use you have to use a
non-broken system. In a non-broken system what you have is that all
those libraries depend on libtalloc.so.1 and they do not statically link
copies of talloc.


Please pay attention to the following specific example, because it
explains my perspective on why on a system with non broken libraries a
soname bump requires to rebuild all packages and is a problem more for
*developers* or people that build their own packages for sources rather
than for package users.

When you suggested the soname bump you said multiple times that you
wanted to do so because this way you could install both libtalloc.so.1
and libtalloc.so.2 at the same time. That is when I got extremely
worried about all this business and is part of the reason why I proposed
the patch to keep the ABI compatible and not bump the soname.
The reason is quite simple, if you think about the following scenarios
where you have a non-broken system.

---
Situation A)
Talloc with soname = 2.0.0:

Assume we fix a minor bug in ldb and release a new version. A developer
fetches the new ldb and finds out it now requires talloc 2.0.0, he
happily builds and install talloc.so.2, a new tevent and the new ldb
with fixes.

All builds fine and libldb now depends on libtalloc.so.2

And here comes the problem.  In a non-broken system libdcerpc will tell
the dynamic linker it needs libtalloc.so.1 and libldb.so
When the linker will load both you will end up with libtalloc.so.1 and
libtalloc.so.2 (via libldb.so) in the same process, aborts will be
everywhere and the developer will not have seen anything at build time
nor at program start-up time, all dependencies are fine.

---
Situation B)
Talloc with soname = 1.4.0:

Assume we fix a minor bug in ldb and release a new version. A developer
fetches the new ldb and finds out it now requires talloc 1.4.0 or the
build fails. He happily builds and install talloc.so.1 (NOTE: this
overwrites the original talloc library), a new tevent and the new ldb
with fixes.

All builds fine and libldb still depends on libtalloc.so.1

In a non-broken system libdcerpc will tell the dynamic linker it needs
libtalloc.so.1 and libldb.so, when the loader will load both you will
end up only with the new libtalloc.so.1 library. No problems, no aborts
of sorts, all just works as expected. The only issue you may have is
that the old libdcerpc may leak some memory when using the old
interfaces, this is no worse than before, it is actually exactly the
same behavior as before, and will be fixed as soon as libdcerpc is
upgraded (as it will be built against the new talloc).


----

I hope you see the striking difference between bumping and non-bumping
in a non-broken system.

What I want to know is if you understand what it means to have a library
in 2 versions when it is included in so many other libraries.

Do you think that that situation A is better or worse than situation B ?

Whether 1.4.0 or 2.0.0 are right or wrong ultimately depends on which of
the 2 situation above we agree is better or worse.

Obviously I think A is much worse than B, but we can discuss the 2
scenarios and come to an agreement.

If we keep the discussion on technical grounds and do not accuse people
of plotting, lying or whatever, and we avoid petty flames on who owns
some piece of code I think we will use our time in a lot more productive
way. 


> So Simo, please look at the above examples, then please revert your
> commit. Also, in future, please don't revert a maintainers commits
> without checking with the maintainer.
> 
> Also, Metze, you were right, your abort() check on version really is
> needed, and really does happen with real examples. Thanks!

Yes, we should probably even think of automatically change the magic at
every build or at the very least at every release (it might include the
version number). 

> To prevent this happening in future we have to stop mixing staticly
> linked libraries with shared versions of the same libs. That will mean
> a lot of changes to the way that lots of libs are produced by the
> Samba project and how they are linked into projects like openchange.

These changes have been advocated by me for long, starting more than 2
years ago, for exactly these reasons, glad that finally someone else
realizes this, it only took 2 years ... 

> I hope I don't have to spend another day like today tracing shared
> library problems. As I have said several times previous when proposals
> of Samba shared libs come up, getting shared libs right is really
> _really_ hard. We have come nowhere near to getting it right yet, and
> the work required to get it right is quite substantial. I'm not
> volunteering to do the work.

To be honest, no, getting shared libs right is not really hard, it only
requires a bit of care for shared libraries specific needs, and clear
code dependencies. The problem with samba is that we have quite a bit of
spaghetti dependencies, and the only way to make useful libraries is to
untangle some of this code. I've been working slowly with Andrew to try
to unravel some of that, unfortunately I've been busy in the last year
or so, so my progress in this area has been very slow.

I've been working (asking mostly) for 2 years to Jelmer and Metze to
help me fix the samba4 build system so that code wouldn't be statically
linked to libraries, or even (horror) linked multiple times within the
samba4 binaries themselves. Unfortunately the build system is so
complicated only Jelmer and Metze seem to understand how it works, so
any fix depend on them getting involved and potentially the person who
wrote and understand some deep down code to provide hooks or to break
some internal dependency. It's long and tedious work, but not
conceptually hard. 


The other issue, also not very hard but important, is that basic
libraries should try to avoid breaking the ABI as much as possible,
because a soname bump is not a solution if different inter-dependent
libraries can be rebuilt at different times. Any incompatible change of
a basic library requires rebuilding all packages linking to the previous
version, and an upgrade of the library. This is a real issue when the
number of libraries and packages dependent on a basic library start
growing.

Talloc, being a memory allocation library, is one such basic library
(but tdb, ldb, tevent are also there to a lesser extent), that is why I
strongly believe we should do all we can to keep the ABI stable.
If we are not willing to make the promise that we will do *all* we
possibly can to maintain library ABIs stable we are going to cause a lot
of problems to all users down the road.
Unfortunately we do not have much choice, we encouraged other people to
use these libraries. Openchange is one of the projects that totally
depends on these libraries, and any change in API/ABI is going to
negatively impact them. I am using them in sssd. Other people have said
or expressed the desire to use them.

So, where do we stand?
Can we take some responsibility not to break our users unless really
necessary ? 


Simo.

-- 
Simo Sorce
Samba Team GPL Compliance Officer <simo at samba.org>
Principal Software Engineer at Red Hat, Inc. <simo at redhat.com>



More information about the samba-technical mailing list