CTDB: Split brain and banning

Thu Nov 1 03:11:08 UTC 2018

Hi David,

On Wed, 31 Oct 2018 16:37:14 +0100, David Disseldorp <ddiss at samba.org>
wrote:

> On Wed, 31 Oct 2018 10:39:41 +0100, Michel Buijsman via samba-technical wrote:
> 
> > > As of ce289e89e5c469cf2c5626dc7f2666b945dba3bd, which is carried in
> > > Samba 4.9.1 as a fix for bso#13540, the recovery master's reclock should
> > > timeout after 10 seconds, allowing for one of the remaining nodes to
> > > successfully takeover. How long after recovery master outage do you see
> > > the ban occur? Full logs of this would be helpful.    
> > 
> > I've run a test with RecoveryBanPeriod=10 to keep the ban time somewhat
> > manageable. It takes a few seconds for the whole cluster to get banned,
> > certainly less than 10. I've attached the relevant logs from two nodes 
> > after I'd killed the third. Lock contention, looks like.  
> 
> It appears that ctdbd doesn't gracefully handle cases where the recovery
> master goes down holding the reclock and standby nodes can't immediately
> obtain the reclock following election. Your reclock helper lock_duration
> setting of "30" means that the standby nodes may need to wait up to 30
> seconds before obtaining the recovery lock.
> If you specify a lock_duration of "5" and set RecoveryBanPeriod=5, does
> your cluster return to OK ~5 seconds after master outage?
> 
> @Amitay/Martin: should I change the recovery lock helper to block while
> retrying multiple times to obtain the recovery lock? Such a change
> should avoid the immediate ban that occurs when we report contention.
> I'm curious to hear how other clustered FSes / lock helpers handle
> releasing the recovery lock once the holder dies - does GPFS do this
> immediately?

[Somewhat repeating myself here for clarity]

That's actually a very good idea... but then the hardcoded 5 second
limit for taking the reclock would be hit.  We could think about making
that timeout configurable.

However, I think Michel's setting for KeepaliveInterval is too
ambitious, given that it is less than the lock duration.  I think that
is the core of the problem.

Having thought about this over lunch, much of this is really a question
about how quickly you can get back a usable cluster.  There's no use
setting KeepaliveInterval so low that you can't get through recovery
because of other limitations.  GPFS (and any other distributed method
being used for locking) will take a finite time to recover themselves.
Setting ElectionTimeout to defer recovery after a node goes away is
just a workaround - you know the node is gone but the cluster is still
in the gutter.

So, the question becomes: do you care if a node has gone away if you
can't do anything (reliable) about it?  ;-)

All that said, I'm definitely not trying to say that there are no bugs
or logical flaws... I really do want to track down any bugs that are
there.

peace & happiness,
martin