RAFT and CTDB

Thu Nov 20 16:41:20 MST 2014

On Thu, 20 Nov 2014 15:24:39 -0800, Richard Sharpe
<realrichardsharpe at gmail.com> wrote:

> Hmmm, so the essential abstraction here is that any node that is no
> longer a member of the cluster (because it can't get a lock on that
> file) cannot try to run recovery. Ie, in ctdb_recovery_lock we try to
> open the recovery lock file and then take out a lock on it.
> 
> The first should/will fail if we are no longer a member of the cluster
> and the second will fail if the cluster properly supports fcntl locks
> but another recovery daemon has already locked the file ...

No, only the recovery master can hold the recovery lock.  Other nodes
would not be able to take the lock but they are still cluster members.

Cluster membership is defined by being connected to the node that is
currently the recovery master.  That is, nodes that the recovery master
knows about (i.e. connected) and are active (i.e. not stopped or
banned) will take part in recovery.

If a node becomes disconnected then it will try to become the recovery
master of its own cluster.  If it can take the recovery lock then it is
allowed to do that.

So the recovery lock simply helps to stop a split brain where there are
multiple independent clusters operating independently.  Each would have
a different cluster database so would have inconsistent ideas of, for
example, locking.tdb... and this can obviously lead to file data
corruption.

peace & happiness,
martin