talloc quiz, and: dangling talloc references and inconsistencies in the talloc model

Sam Liddicott sam at liddicott.com
Thu Jan 15 13:45:30 GMT 2009


[tridge: we really need your eye on this; it's possible that talloc is
really badly broken. I've discussed this a lot on irc, and I don't seem
to have missed anything, the only debate is on (a) whether we like it
that way (b) whether samba is too dependant on the current behaviour. If
I do severely embarrass myself, please be kind.

I've also copied other people who I've discussed this with on irc]

talloc is probably the most useful non-gc memory allocator available.
The ability to more than just reference-count but to store the owner of
each reference is brilliant.

Here is a brief quiz on talloc. Try and match up the error conditions
with the talloc_usage and see how surprised you are. Then read on the
analsys of the problem, and the solution and its benefits.

Quiz:

Here are some run-time program errors listed A-C. Match these up with
the psuedo-code blocks listed 1-3 which cause them.

A. talloc_get_name_abort() seems to fail randomly. (or you get segfaults
if you aren't using talloc_get_name_abort)

B. memory leaks. You seem to using talloc properly, but some objects
surprisingly don't get freed even though you are explicitly freeing
them. As a fix, freeing them twice seems to work, but then you sometimes
get talloc_get_name_abort() or other errors.

C. talloc_double_free() happens, even though you got everything right

1.
thing=talloc_zero(owner, ...
save_for_later(thing);
talloc_free(thing);

2.
thing=talloc_zero(owner, ...
save_for_later(thing);
talloc_free(thing);
return;
# owner dies shortly after return

3.
# make thing ready to be stolen by finalizer
thing=talloc_zero(owner, ...
save_for_later(thing); # for monitoring
return;
# owner might die
# and then "later" somewhere else in the code
finalizer_steal_and_use(thing);
# finalizer dies
sneak_final_look(thing); # monitor
talloc_unlink(later, thing)

=== ANSWERS FOLLOW ===








Answers:
A. 2. You are probably thinking that save_for_later() should have taken
a talloc reference. It did. But a dangling reference will still occur if
(and only if) "owner" is freed before the reference is used. It would be
difficult to communicate this between two parts of the program,
especially if one was written as a loadable module long after the other
part was written.

B. 1. This looks a bit like A, but if the owner doesn't die then the
"thing" will remain for ever. Really. if save_for_later(thing) didn't
take a reference then thing would have been free'd by talloc_free() and
then it couldn't use it for later. So the choice is keep for ever or
keep for never. (But only if the original owner will be long lived).

C. 3. This could also cause talloc_get_name_abort() failures if the
memory was re-allocated in the meantime. It seems quite legitimate to
keep a reference (in save_for_later) so that you can take a peak at an
object after it would have been destroyed. And so it is. UNLESS
talloc_steal is used AFTER the original parent died. It results in the
reference becoming the new parent and the reference is then stolen from
and not kept when finalizer dies. (You can use talloc_steal BEFORE the
original parent dies. But how are you going to tell that?)

Note that none of the errors are caused by obvious mis-use of the
published API or model, but rather un-written forbidden combinations of
talloc calls; which actually require a run-time mismatch in the number
of ref/unref operations which are in fact exactly paired. It would be
much better to require there NOT to be a ref/unref mismatch, never mind
a runtime-dependant one!


Scoring:

3: You either wrote talloc or learned the hard way
2: You are a lucky guesser
1. So are you
0. You thought the API behaved how it looks. You've started on the path
of the hard way. Learn quickly young grasshopper, or despair will
prevent you from mastering talloc.

We see that talloc presents a nice clean api, but it sadly doesn't
behave as one would expect.  I think there are some slight but
significant problems with the way references have been implemented which
result in some serious inconsistencies in the talloc model.

I realise that talloc has grown with samba which may depend on some of
the inconsistencies. I hope my suggestions can be judged on their own
merits with respect to an ideal stand-alone talloc library as well as in
the context of how samba works, and also in the need for many users of
talloc to derive a set of "be safe" rules which sadly means avoiding
much of the power of talloc.

My suggestions also greatly simplify and extend the
talloc-free-loop-detection code which tries to avoid leaks due to
reference loops.


Analysis
========

1. Inconsistent talloc_free

The first inconsistency relates to the freeing of allocations that have
a reference.

Freeing the parent of a referenced allocation should have the same
effect on the allocation as freeing the referenced allocation directly

Currently, if a referenced allocation is freed because it's parent is
freed, then the first reference becomes the new owner.
Inconsistently, if a referenced allocation is freed explicitly, then the
first non-child reference is removed and the owner remains.

Consequences

1a Confusion

It seems confusing for a talloc user to be aware of these two different
ways in which something may get destroyed, having un-intuitive and 
different effects on the lifespan of the allocation.

1b dangling talloc references

If free is called explicitly then the allocation will only live as long
as the original parent despite the fact that another object took a
reference specifically to make it live longer. This causes dangling
references (if free is called implicitly because the parent died, then
the allocation will live as long as the object that took the reference
as would normally be expected). This problem may be worse than it seems
for those who use talloc_free repeatedly on the same allocation. (see
later on)

1b Memory leaks

When an explicit free occurs on a referenced allocation the owner
remains. If the owner is long lived then the allocation will hang
around, even when the reference is released, perhaps until all memory is
consumed.

Safe coding rule: Never talloc_reference something if it could be
explicitly freed.
Safe coding rule: Never use an explicit free on structs, but rather use
a temporary context which you can explicitly free
Safe coding rule: If you are not sure, don't use talloc_reference

These safe coding rules break the principle of encapsulation as it
requires you to know what the caller is going to do with the allocation.

Metze has propsed a patch for this behaviour so that reference promotion
to owner happens consistently.

This leads on to the main point

2 Inconsistent treatment of references

The first inconsistency was the inconsistent promotion of references to
owner; the second inconsistency is in the very idea of promoting a
reference to become owner, as happens when talloc_free is called on an
allocation that is referenced (tc->refs != NULL), or when talloc_unlink
does a similar thing.

Inconsistent treatment of references occurs because references which are
promoted to owner are no longer treated as references, the advantages of
reference go away without warning with far reaching side-effects,
including a mismatch between the number of referencing operations and
the number of freeing operations. (This is hidden mostly as
talloc_unlink against a non-owner non-reference ctx silently does
nothing, but has other harder to spot side effects like dangling references)

When a talloc_reference is taken, the referencer has no expectation of
being an owner; or to put it another way, all code that uses talloc
suddenly becomes a lot more complicated if reference takers have to also
be coded to cope with becoming owners without warning- should it they
keep testing: me == talloc_parent(thing) and then take another reference
and talloc_steal to NULL first?

This promotion is both unnecessary and IMHO wrong, as it leads to
difficult to predict or manage consequences.

Consequences

2a. dangling references

An reference could become an owner without notice, is then subject to
removal by talloc_steal and by implicit talloc_free if the parent is
freed (and also by an explicit talloc_free if talloc_free is improved as
metze suggests). This can then result in a reduced lifetime, resulting
in a dangling talloc_reference - difficult to detect and almost
impossible to track down as it depends on the precise
reference-free-steal order being duplicated, which may happen rarely in
a networked server like samba.

Safe coding rule: Magically learn to tell if the current owner was once
a reference and write more complicated code to cope with it.
Safe coding rule: Don't use talloc_steal on things whose original owner
might have been free'd.
Safe coding rule: Or don't use both talloc steal and talloc reference on
the same kinds of structs anywhere within the same application or
loadable modules. And keep a register of which structs you will
reference and which structs you will steal.

2b dangerous coding practices

It further results in a habit of using talloc_free to remove references
as each reference becomes the next owner. This practice is often
dangerous as it is impossible to predict which reference is the new
owner to be freed.
Which reference becomes the new owner is dependant on the order in which
references are taken - something which should not be depended upon in a
system with loadable modules, or dependant on network timing - or at all!

Safe coding rule: Please don't use talloc_free more than once on an
allocation, and only then if you haven't given it to be stolen

I hope there is no code in the wild that depends on talloc_free to
remove a reference. If there is it could break untraceably, merely by
someone else taking a reference at the wrong time. What a way to die!


Review
======

An implementation of a POSIX API may be cumbered by such caveats, but a
clean fresh interface like talloc should behave in a straight-forward
fashion without strange behind the scenes interactions between
operations that ought to be simple with zero unexpected side effects.

I think this is enough to demonstrate that the promotion of references
to owner is freaky, but some readers might want a sort of real life
example of mixing talloc_steal and talloc_reference:.

Example:
A queue takes references to objects which it later passes to an API
which uses talloc steal. The donor of the objects expected talloc_steal
to be called. The queue only takes references. If the donor dies, then
the queue becomes the owner and is evicted (without warning) at
talloc_steal, leaving the queue with a dangling pointer. The problem
with this of course is not that talloc_steal does in fact steal, but
that it becomes impossible to always predict who it is stealing from,
and it depends on actions that might happen AFTER the queue takes the
reference.


Solutions
=========

1 Don't do that (for a complex definition of "that")

For those who would really rather keep coding around this talloc
problem, consider that the example queue is a resource manager so that
too many requests don't get transmitted at once, and that the donor may
or may not be able to hang around until all requests are complete. It
could be coded around but to code around would be to embrace the flaw
instead of the beauty.

The code could, of course, be written a different way, and in fact some
developers have privately constructed a series of implicit rules that
are suitable to their own use of talloc to help avoid such problems;
rules that have been suggested are along the lines of:

* don't use talloc_reference, it's evil
* don't use talloc_steal unless you know where your memory has come from
* don't call things that use talloc_steal unless you want your memory to
be stolen
* don't use talloc_steal, use talloc_reference
* don't use talloc_reference if you use talloc_steal

Sadly these are not very useful rules, partly because they are a big
secret discovered individually through much pain, and also because the
more comprehensive safety rules are hard to verify against as they
involve tracking all possible talloc operations on all possible sources
of the structs being worked on, whether these operations may happen
before or after the allocation is processed.

The more comprehensive safety rules are also defined in terms of tallocs
own limitation and can better be reduced to "don't do things that cause
unexpected dangling talloc references or memory leaks" - true, but
unhelpful, and mistakes can't always be detected and even then, only
long after the mischief occurred.

The only workable rules are the more general rules which generally come
down to denying the richness of talloc by refusing to mix talloc
ownership with talloc references.

2. ...or fix talloc

Ironically the problem (resulting in the rule "don't mix ownership and
references") arises precisely because talloc itself mixes the idea of
ownership with references. If talloc did not promote a reference to
become an owner then it would be entirely safe for talloc_steal,
talloc_free, talloc_reference, talloc_unreference and talloc_unlink to
be mixed without danger.

The solution here is that references should NEVER be promoted to owner,
and that concepts of referencing and ownership should be kept separate,
and also therefore *predictable* and simple to understand and not in any
way dependant on the unpredictable order in which blocks of code in an
asynchronous event driven server are executed.

The only way to remove a reference should be with talloc_unreference (or
talloc_unlink which calls it), which of course happens automatically if
the referencer is freed.

Reference promotion offers absolutely zero public benefit (except to
people who dangerously want to use talloc_free to remove what they hope
is the right reference - and have no way of verifying not even for an
ASSERT).

Reference promotion has some slight side effect used by the partial loop
detection at the top of talloc_free, but the fix suggested makes loop
detection simpler and more effective.

Solution analysis
=================

New behaviour for talloc_free:
* If there are no references it behaves as normal.
* If there are references or if the destructor returns -1, then it
receives a new owner: no_owner_context instead of messing around to try
and find a random new owner
? If the current owner was no_owner_context maybe it should just raise
talloc_double_free ?

New behaviour for talloc_reference_destructor:
* If the last reference has been removed AND the current owner is
no_owner_context then free is called

New behaviour for talloc_steal
* none, really! You can talloc_steal something owned by no_owner_context
and then free it again, and it still hangs around if there are references.

And so we suddenly have a more sane version of talloc
* Predictable talloc_free behaviour - you finally know when and how
something will be destroyed
* No more calling talloc_free dozens of times in the hopes that you know
which reference you are destroying
* No more accidentally calling talloc_free too often and not getting an
error, but later wondering where your references went
* You can take a reference to an allocation and then pass it anywhere,
safe in the knowledge that it will still be there when you need it, no
matter what else has happened to it.
* No more assigning un-owned things to null_context, but instead to the
more explicit no_owner_context
* You don't have to worry if references have been taken before calling
talloc_free
* You can safely call talloc_steal if the original owner has been unlinked

In other words what's the benefit of keeping this weird behaviour that
solely makes traps for the unwary and are too complicated to properly
document, when a simple fix makes the API behave as expected and
improves loop detection?

Old loop detection

Removing a child reference when trying to free an object, when the child
itself is also referenced, will result in the child outliving the object
and having a dangling pointer due to the reference being dropped.
(Current behaviour, see test case test_dangling_loop).

idra also has concerns about loop detection.

Improved loop detection.

I have not written this improved loop detection.

Loop detection may easily be improved because the no_owner_context makes
it easy to detect things that are only kept alive because of references.

A node can be deleted if:
* it is descended from no_owner_context
* so are all of it's references
a test can be called from talloc_reference_destructor and talloc_free
(with tc->refs) to check for this.

Some thought may need to be given to the order of destruction, given
that some of the destructors might call talloc_unlink which might refuse.
Probably a node which has the longest ancestry to no_owner_context
should be deleted first; and of nodes with the same ancestry size, the
node with the least number of references.

Patches
=======

Patches to test for the new behaviour and fix the old tests to match the
new behaviour are sent to the samba technical mailing list

They serve mainly as a talking point for now, because no doubt samba
depends on calling double_free in some cases where it thinks it knows
what the reference is.

Sam


More information about the samba-technical mailing list