the sorry saga of the talloc soname 'fix'

tridge at samba.org tridge at samba.org
Mon Jul 6 11:22:55 GMT 2009


Hi Volker,

 > The changes to talloc you made are:
 > 
 > a) Add talloc_reparent
 > 
 > b) Print error messages to stderr if talloc_[steal|free] is 
 >    used when references still exist.

It isn't just printing error messages. It also refuses to do ambiguous
operations. So if you do this:

  int *p1 = talloc(NULL, int);
  int *p2 = talloc(NULL, int)
  int *c1 = talloc(p1, int);
  int *r1 = talloc_reference(p2, c1);
  assert(r1 == c1);

then at the end, one piece of code might do:

  talloc_free(r1);

and another bit of code might do:

  talloc_free(c1);

the problem is that c1==r1, so talloc doesn't know which one you
mean. If you free via one of the parents (p1 or p2) then there is no
ambiguity. If you free via a child with references then there is no
way to know what the programmer intended. We can't look up the program
stack and say "ahh, they are in the module that did the reference,
they must mean to reduce the ref count". There is just no sane way to
do that in C.

At first I just added the warning, but kept the old behaviour, which
was to pick one of the two ambiguous choices. In the case of
talloc_free() that was to free the "most recent parent" (or close
enough, the details of "most recent" are complicated by a steal). In
the case of steal it was to "always steal from the 'real' parent". 

Once Andrew and I started to look at where these warnings cropped up
though, it became very clear that they were in fact real bugs. Some
were memory leaks. Some just assumed the opposite behaviour of what
was actually done. None of them were clearly 'correct'. So the only
sane choice was to change the API.

So we now refuse the talloc_free(), and print a warning giving both
the line of code where the talloc_free() happened, and all the places
that did talloc_reference() of the pointer. That makes it easy to find
the problem. The usual fix is to then use talloc_unlink() which allows
you to specify which of the two possibilities you wanted.

The same applies to talloc_steal().  I added talloc_reparent() as it
allows you to solve the problem in much the same way that
talloc_unlink() solves the problem for talloc_free(). A
talloc_reparent() allows you to do a "steal" like operation, but
specifying which parent you want to steal from so there is no
ambiguity.

If you try to do an ambiguous talloc_steal() then it will be refused,
and will return NULL. In all likelyhood existing code will crash, as
we often do:

  a->p = talloc_steal(x, y);

and then don't check the return. It isn't ideal that code like this
will now crash, but at least there will be a clear warning printed
first saying what the problem is.

We still need a better way than just using stderr for these
warnings. The patch adding #ifdef DEVELOPER is not the right fix, as
these bugs can occur at runtime. We need something like:

  talloc_set_log_fn(samba_talloc_log);

so the application offers a clear function to use for logging talloc
warnings.

All of this is both an API change (although I would claim only for
code that was already broken!) and an ABI change. Thus we need a new
.so number.

Cheers, Tridge


More information about the samba-technical mailing list