RAFT and CTDB

Martin Schwenke martin at meltin.net
Thu Nov 20 17:04:32 MST 2014


On Thu, 20 Nov 2014 15:55:39 -0800, Richard Sharpe
<realrichardsharpe at gmail.com> wrote:

> On Thu, Nov 20, 2014 at 3:41 PM, Martin Schwenke <martin at meltin.net> wrote:
> > On Thu, 20 Nov 2014 15:24:39 -0800, Richard Sharpe
> > <realrichardsharpe at gmail.com> wrote:
> >
> >> Hmmm, so the essential abstraction here is that any node that is no
> >> longer a member of the cluster (because it can't get a lock on that
> >> file) cannot try to run recovery. Ie, in ctdb_recovery_lock we try to
> >> open the recovery lock file and then take out a lock on it.
> >>
> >> The first should/will fail if we are no longer a member of the cluster
> >> and the second will fail if the cluster properly supports fcntl locks
> >> but another recovery daemon has already locked the file ...
> >
> > No, only the recovery master can hold the recovery lock.  Other nodes
> > would not be able to take the lock but they are still cluster members.
> 
> Isn't that what I said? When I said cluster above I was referring to a
> GPFS cluster.

CTDB has its own independent notion of cluster membership and I thought
you were referring to that.  I didn't notice you mentioning GPFS.  :-)

> > Cluster membership is defined by being connected to the node that is
> > currently the recovery master.  That is, nodes that the recovery master
> > knows about (i.e. connected) and are active (i.e. not stopped or
> > banned) will take part in recovery.
> 
> OK, that is a wrinkle I had not thought of. What if they have lost
> connection to the GPFS cluster but are still talking to the recovery
> master?

Then you would hope that they can't take the recovery lock.  ;-)

If a node in a break-away cluster (i.e. lost CTDB connection with
main cluster - perhaps just 1 node) wins an election then it will try to
become recovery master.  When it tries to take the recovery lock and
fails it will ban itself.  Rinse and repeat for other nodes in the
break-away cluster.

So, provided nodes in a break-away cluster can't take the recovery lock
then they will all get banned and can do no harm.

If such nodes can still take the recovery lock after being expelled
from the GPFS cluster then you should probably have the appropriate GPFS
callback shutdown CTDB.  Depending on the CTDB configuration, this will
probably take down Samba and other services, preventing any issues.

peace & happiness,
martin


More information about the samba-technical mailing list