Running ctdb with a designated master

Tue Jun 21 10:33:38 UTC 2016

On Tue, 21 Jun 2016 03:37:12 -0500, Steve French <smfrench at gmail.com>
wrote:

> On Tue, Jun 21, 2016 at 3:31 AM, Steve French <smfrench at gmail.com> wrote:
> > On Tue, Jun 21, 2016 at 3:16 AM, Steve French <smfrench at gmail.com> wrote:  
> >> I tried some experiments with forcing only one node (a cluster file
> >> system's metadata server e.g.) to be ctdb master by putting the line
> >>
> >> CTDB_CAPABILITY_RECMASTER=no
> >>
> >> in the ctdb config file of all other ctdb nodes, and removing the line
> >>
> >> CTDB_RECOVERY_LOCK
> >>
> >> from those systems as well and specifying
> >>
> >> CTDB_RECOVERY_LOCK=/var/lock/.recoverylock
> >>
> >> only on the master.
> >>
> >> I was wondering if it is safe to remove the CTDB_RECOVERY_LOCK file
> >> from the config of the non-master nodes - it seemed to work, but there
> >> are various warnings about never running without a CTDB_RECOVERY_LOCK
> >> line.  
> >
> > This is looking strange - I rebooted a machine as an experiment (not
> > the master), and the node went unhealthy when it started back up -
> > looking in the logs it tried to grab the recovery lock
> > (/var/lock/.recoverylock) even though I don't have that line
> > configured on that node.  It is apparently getting the location of the
> > recovery lock from the master - and then won't start because it thinks
> > it should be locked (it is but on the master - it shouldn't be using
> > it, it is a local file, a dummy file).
> >
> > If I am forcing one node to be the master by setting recmaster = no in
> > the config, should I remove CTDB_RECOVERY_LOCK from ALL nodes ctdb
> > configuration?  
> 
> By the way - that did seem to work ... but seems a little strange

There is a sanity check at the end of each recovery.  The recovery
master holds the recovery lock in the recovery daemon.  As part of
changing the recovery mode back to normal on each node, the main daemon
tries to take the recovery lock.  If it succeeds then it doesn't come
out of recovery, the recovery fails and the offending node is
eventually banned.

In CTDB <= 4.4.x, the recovery lock can be updated at run-time.  The
recovery master updates the recovery lock configuration in other
nodes.  So that is what you're seeing.

In master, we decided there is no safe way to recovery from a failure
to update the recovery lock, so we simplified life by removing the
ability to update it (though you can still be creative with helpers).
However, to ensure that the recovery lock is useful, we added a
consistency check.  If the recovery lock is inconsistent with that on
the master at the start of recovery then ctdbd will shut down.

It doesn't make sense for just 1 node to have the recovery lock.  In
that case it isn't locking against anything and can be de-configured.

In master you can always add a call-out that checks for some condition
on the node and only takes the lock if that property is satisfied.

peace & happiness,
martin