[PATCH] tevent and threads - infrastructure improvements - version #2

Tue Sep 15 16:32:55 UTC 2015

On Mon, Sep 14, 2015 at 07:56:02PM -0700, Jeremy Allison wrote:
> On Mon, Sep 14, 2015 at 05:57:28PM -0700, Jeremy Allison wrote:
> > 
> > OK. Nailed it. The following patchset consistently
> > passes valgrind with drd and helgrind with no error
> > messages. The previous patchset *sometimes* (1 time
> > in 10 or so :-) gave errors with helgrind.
> > 
> > The issue in the test is when the sub-thread deletes
> > it's local event context helgrind was detecting a
> > data race inside the pthread_mutex_unlock() inside
> > tevent_thread_proxy_schedule() called from
> > master_callback(). The clash was with the
> > pthread_mutex_destroy() inside the destructor
> > for struct tevent_thread_proxy.
> > 
> > Changing the test to return the child thread
> > event context back to the parent thread, and
> > have the parent thread do the talloc_free()
> > in it (which destroys the mutex) once the
> > child thread has exited and been joined
> > removes the race.
> > 
> > To be honest I can't see how the race
> > could actually happen in practice here,
> > as the data flow in the child thread is :
> > 
> > proxy_schedule on master -> wait for master to reply
> > 	-> release resources
> > 
> > and the data flow in master is
> > 
> > create child thread -> wait for child request ->
> > 	send reply to child -> wait for child exit and join.
> > 
> > so how the 'release resources' could overlap
> > with the 'send reply to child' beats me - the
> > child must have *gotten* the reply before
> > it releases resources, but helgrind occasionally
> > complained about it.

agreed.

> Thinking about it, I think I know why helgrind
> complained. helgrind doesn't see the write -> pipe,
> read <- pipe constructs as a barrier, even though
> because they're forming a request / response pair
> they *are* a barrier for the purposes of unlocking
> then deleting the mutex.

Afaict helgrind complains because destroying the mutex can't be
done under protection of the lock and as such is subject to race
conditions when other threads try to acquire the very same mutex.

> So my test code change works as a helgrind work-around,

Applying the helgrind patch from this thread also lets helgrind pass
without errors:

<http://valgrind.10908.n7.nabble.com/Helgrind-3-9-0-false-positive-with-pthread-mutex-destroy-td47757.html>

Afaict the patch changes helgrind behaviour to not report *possible*
races as actual errors. As the test passes helgrind with this patch,
we know our code is safe in practice as long as the user only doesn't
prematurely free the proxy object.

> but in the real world isn't strictly neccessary,
> which is why it's safe to leave the docs with
> the talloc_free(ev) construct at the end of the
> child thread.

I'm not fully convinced. :)

There's a race condition in tevent_thread_proxy_destructor() and
helgrind pointed at it: we can't guarantee that a thread calling
tevent_thread_proxy_schedule() with a reply request passing a proxy
object doesn't free that proxy object prematurely for whatever reason.

The called thread would call tevent_thread_proxy_schedule() to send
the reply, locking an already destroyed mutex which is undefined
behaviour. Can we add some documentation that freeing proxy objects
with in flight operations should be avoided or similar?

Fwiw, my patch moving the signaling out the lock protected code in
tevent_thread_proxy_schedule() indeed added a similar race that would
be triggered by prematurely freeing the proxy object. At least this
one can be fixed by keeping the signalling in the lock protected code.

-Ralph

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de,mailto:kontakt@sernet.de