[PATCH] Improve code in ctdb lock helper

Fri Dec 2 05:19:58 UTC 2016

On Fri, Dec 2, 2016 at 4:06 AM, Jeremy Allison <jra at samba.org> wrote:

> On Thu, Dec 01, 2016 at 10:47:06AM +0100, Volker Lendecke wrote:
> > On Thu, Dec 01, 2016 at 03:33:09PM +1100, Amitay Isaacs wrote:
> > > Hi,
> > >
> > > Here are two patches that do the following:
> > >
> > > 1. Drop the support for locking multiple databases since it's not
> required
> > > any more.
> > > 2. Add explicit unlocking of record/db when lock helper is exiting
> >
> > Are we sure that we only ever call async signal safe functions in the
> > tdb functions? For example I'm pretty certain that tdb_close calls
> > free(), which is not async signal safe. What happens for example if
> > the signal comes before we expect it, and we are in a malloc call?
> > This would block the child forever.
> >
> > Can we change this to not call tdb functions directly in the signal
> > handler but set a global variable, expecting EINTR at all relevant
> > points?
>
> Oh god yes. The list of things you can do inside a signal
> handler is vanishingly small, and tdb calls are certainly
> not on that list.
>

Yes, point well taken. I will rework the patch.

I have now managed to reproduce the bug which the second patch was trying
to address.

CTDB runs (historically) as real-time process and all the lock helper
processes also run as real-time processes. We would like to get rid of
real-time scheduling completely.

The first step was to see if we can run the lock helper processes with high
priority (nice) rather than running them as real-time processes.  Testing
with this change, Martin and I noticed that we get lots of stuck lock
helper processes.  The lock helper processes were waiting for the mutex (as
seen from the kernel stack in /proc/<pid>/stack).  Querying the mutex
revealed that no process was actually holding a lock on that mutex.

I have been trying to create a reproducer for this for some time.  Now I
have managed to create a reproducer using tdb and later using robust mutex
directly.

Attached is the version using robust mutex. It's based on the robust mutex
code in tdb and the way lock helpers are used in CTDB.

It does the following:

- Open a file, mmap it and create a mutex
- Lock the mutex
- Create <n> child processes
- Unlock the mutex
- Wait for child processes to terminate

Each child process does the following:

- Lock the mutex
- Kill itself with SIGKILL

Ideally, one would expect the program to terminate normally.  After each
child process is killed, the robust mutex should get "freed", so it can be
used by next child process.  Once all the child processes are killed, the
program will exit.

Volker, I would really appreciate if you can verify that I am doing testing
correctly.

Usually it takes few runs to reproduce the problem.  So I run this as
follows:

  # while true ; do ./bin/test_mutex_raw /tmp/foobar 10 0 ; done

This will create 10 child processes running with normal priority (0).  The
last option is for running the processes real-time (1) and with nice -20
(2).
After few iterations the program will get stuck.  At that point you can
verify that the child processes are all waiting for the mutex.

To inspect the owner of the mutex:

  # ./bin/test_mutex_raw /tmp/foobar debug

If any process is holding a lock on the mutex, then this would output the
pid of the process holding a lock.  If no process is holding a lock,
nothing will be printed.

If you perturb any of the stuck processes - either using strace -p or
gstack, it will release that process and sometimes all the stuck processes.

The real worry is that this FAILS on modern systems (with latest
kernel/glibc) with any priority setting.  This points to a bug in glibc,
kernel or both.
We haven't noticed this before since we are running on RHEL6 with real-time
scheduling and there it takes long time to reproduce this bug.

Any comments / suggestions are welcome!

Amitay.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_mutex_raw.c
Type: text/x-csrc
Size: 4729 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20161202/777e0617/test_mutex_raw.c>