CTDB: Split brain and banning

Martin Schwenke martin at meltin.net
Thu Nov 1 01:07:36 UTC 2018

Hi Michel,

On Tue, 30 Oct 2018 15:10:18 +0100, Michel Buijsman via samba-technical
<samba-technical at lists.samba.org> wrote:

> I'm building a 3 node cluster of storage gateways using CTDB to connect 
> various NFS and ISCSI clients to CEPH storage. I'm using a rados object as 
> reclock using ctdb_mutex_ceph_rados_helper.
> I'm having two problems:
> 1. Node banning: Unless I disable bans, the whole cluster tends to ban 
>    itself when something goes wrong. As in: Node #1 (recovery master) dies, 
>    then nodes #2 and #3 will both try to get the reclock, fail, and ban 
>    themselves.
>    I've "fixed" this for now with EnableBans=0.

This *should* be better in 4.9.1 due to:


When I introduced the recovery lock helpers I didn't think about races
between elections and attempts to take the reclock.  The bug fix means
that when an attempt to take the reclock fails then the recovery daemon
checks if the recovery master has changed instead of banning the node

Please let me know if this isn't helping the above case.

> 2. Split brain: If the current recovery master drops off the network for 
>    whatever reason but keeps running, it will ignore the fact that it can't 
>    get the reclock: "Time out getting recovery lock, allowing recmode set 
>    anyway". It will remain at status "OK" and start to claim every virtual
>    IP in the cluster.

Yeah, that's a tough one to get right... but please see my comments
below about the keep-alive configuration.

> The split brain is obviously a problem as soon as the node gets back online:
> Having IPs up on multiple nodes, having that node try to (re)claim resources 
> that have timed out and failed over to other nodes, etc.
> That node doesn't seem to recover either after getting back on the network:
> It still thinks it's the recovery master and will keep trying for a reclock,
> getting lock contention, without resetting itself.

I don't understand this.  :-(

There are 2 distinct cases that I can see:

* The node that was recovery master reboots or, at least, ctdbd

  In this case it doesn't know which node is master so should call an
  election, so this shouldn't happen.

* The node dropped off the network and reconnected

  In this case the node has missed the election, so still thinks it is
  master. However, in this case the run-state should not be "first
  recovery", so the node should be banned if it can't take the lock.

Hmmm... is the real problem that the node still thinks it is master
*and* it still thinks it has the reclock?

If you're seeing something different, can you please explain it again,
compared to the above 2 cases?

> I ran into this using CTDB 4.7.6 on Ubuntu 18.04 Bionic. Since upgraded to 
> 4.9.1, which still shows the same behaviour. Other than the event handlers
> this is a fairly standard CTDB config, I've just configured the reclock to 
> use the ctdb_mutex_ceph_rados_helper and played with a few tunables:
> [...]
>     KeepaliveInterval=1
>     KeepaliveLimit=5

I think this is part of the problem.  The above are used to detect a
node that hasn't been heard from "recently", so is considered "dead".
The above settings will cause ctdbd to detect this in only 5 seconds.
However, in David's reply he says that ctdb_mutex_ceph_rados_helper
will time out after 10 seconds.  So, the node failure is being
processed before the reclock has timed out.

Do you really need to notice dead nodes that quickly?  :-)

> Grepping the source, ignoring the reclock when it times out seems to be a 
> conscious decision. This strikes me as odd since it directly leads to split 
> brain in this case. I would expect it to fail hard on not getting a lock. 
> Would it be possible to make this behaviour configurable with a tunable?

I obviously need to think about that.  :-)

I wonder if we need a configurable number of timeouts during which the
lock is retried... and then finally banned.  This relates to David's
question about whether the helper should block and retry internally -
that seems like a better solution.  However, then we hit the fixed 5
second time-out that is allowed for taking the recovery lock.  Perhaps
that needs to be configurable.  Then David's suggestion could work.

> Or am I doing something wrong? :)

Not "wrong", but I think KeepaliveInterval=1 is very "ambitious".  :-)

More replies coming for other messages...

peace & happiness,

More information about the samba-technical mailing list