Some thoughts on the external recovery lock helper

Sat Jun 18 03:20:55 UTC 2016

On Fri, 17 Jun 2016 09:53:50 -0700, Richard Sharpe
<realrichardsharpe at gmail.com> wrote:

> On Fri, Jun 17, 2016 at 9:39 AM, Ira Cooper <ira at wakeful.net> wrote:

> > I might want to do the same thing for etcd, or some other database.
> >
> > We may also want the "locking" and the "database" to be separate.  
> 
> Indeed. There are two locking issues as well:
> 
> 1. The recovery lock that ctdb needs, and
> 
> 2. The need to lock records in the tdb store as they are being messed with.
> 
> These things are logically separate although they might use the same
> underlying mechanisms.

Also worth mentioning what is planned to come next...

To separate out cluster management and database recovery functionality,
we'll introduce a cluster lock, which will use the same mechanism as
the recovery lock.  The recovery lock will then be released after each
recovery.

I had originally intended to take the cluster lock when an election
times out and a node thinks it has won, and release it when a node
loses an election.  However, when I tried this out I found nasty edge
cases.  Amitay reminded me that we've already dropped the connected-ness
aspect of elections because, if a cluster becomes partitioned, whoever
holds the lock wins even if they're a lone node.  All nodes in the
other partition will try to take the lock, fail and then ban themselves.

So, the simplest election mechanism involving a cluster lock is for all
nodes to try to take the lock.  Whoever takes it is master.  So, when
the cluster lock is enabled we use an alternate election mechanism that
is just a race.

2 things to consider here:

* This could cause instability because the master could move around a
  lot.  2 ways of combating this:

  1. A node that is currently master and can continue as master doesn't
     actually need to release the lock during an election.  It just
     maintains the lock and stays master.

  2. We have less elections.  The master broadcasts its status.  When
     nodes join, instead of triggering an election they listen for a
     broadcast from the master. A timeout triggers an election.

* I've seen quorum mentioned.  This is already supported by the most
  recent round of changes but it is a bit subtle.

  If you want a mutex helper that supports quorum then it:

  1. Exits when the node loses quorum.  The cluster manager notices the
     helper is gone and triggers an election.

  2. A helper that doesn't want to take a mutex because a node isn't a
     quorum member should log this (to stderr) and refuse to take the
     mutex, claiming contention.

When we have semantics of database recovery clearly defined with
respect to elections then we might not need the recovery lock.
However, that's not a decision that needs to be made too far in
advance.  There's enough going on...  :-)

peace & happiness,
martin