[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting

Thu Jun 15 01:22:00 UTC 2023

Hi,

On Wed, 14 Jun 2023 09:15:58 +0800, "tu.qiuping via samba"
<samba at lists.samba.org> wrote:

> My ctdb version is 4.17.7

> My ctdb cluster configuration is correct and the cluster is healthy before operation.

> My cluster has three nodes, namely 192.168.40.131（node 0）,
> 192.168.40.132（node 1）, and 192.168.40.133（node 2）. And the node
> 192.168.40.133 is the leader.

> I conducted network oscillation testing on node 192.168.40.133, and
> after a period of time, the lock update of this node failed, and at
> this time, the lock was taken away by node 0. Amazingly, after node 0
> received the lock, it sent a message with leader=0 to node 1, but did
> not send it to node 2, even though the network of node 2 was healthy
> at this time. And when I restored the network of node 2, node 1 and
> node 2 kept trying to acquire the lock and reported an error: Unable
> to take cluster lock - contention.

It would be excellent if you could describe your network oscillation
testing in some more details.  I assume you are taking down a network
connection for some time and then bringing it back up.  The down and up
timings would be useful for debugging.

By the way, you should never be amazed by clustering.  It is all based
on timeout-based heuristics, so you will almost certainly see unexpected
things.  :-)

I strongly suggest that you merge the logs by sorting them when trying
to debug these types of issues.

The main events are...

Nodes 0 and 1 are worried about their leader:

  2023-06-12T19:39:32.603600+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Leader broadcast timeout
  2023-06-12T19:39:32.604185+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout

The nodes notice that node 2 has become disconnected from the rest of
the cluster:

  2023-06-12T19:39:34.005643+08:00 host-192-168-40-132 ctdbd[659192]: dead count reached for node 2
  2023-06-12T19:39:34.343927+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 0
  2023-06-12T19:39:34.344332+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 1
  2023-06-12T19:39:35.414699+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2

During this time, node 2 completes recovery with itself:

  2023-06-12T19:39:34.361318+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to ACTIVE
  ...
  2023-06-12T19:39:34.445242+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL

Node 2 starts to have problems because it can now take the cluster
lock during recovery:

  2023-06-12T19:39:37.680876+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR: Daemon able to take recovery lock on "!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4" during recovery

Node 0 successfully takes the cluster lock:

  2023-06-12T19:39:39.229085+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Took cluster lock, leader=0

Node 0 runs a recovery for nodes 0 and 1:

  2023-06-12T19:39:39.420005+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery
  ...
  2023-06-12T19:39:39.432822+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery has started
  2023-06-12T19:39:39.432969+08:00 host-192-168-40-132 ctdbd[659192]: Recovery has started
  ...
  2023-06-12T19:39:39.549895+08:00 host-192-168-40-131 ctdb-recovery[3065583]: 1 of 1 databases recovered
  2023-06-12T19:39:39.549985+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery mode set to NORMAL
  2023-06-12T19:39:39.551083+08:00 host-192-168-40-132 ctdbd[659192]: Recovery mode set to NORMAL

Node 2 still tries to recover because it thinks it is leader:

  2023-06-12T19:39:39.684162+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery

Node 1 receives a leader broadcast (note that node 2 is still disconnected):

  2023-06-12T19:39:39.685181+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=0

Nodes 0 & 1 notice the return of node 2:

  2023-06-12T19:39:41.420521+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: connected to 192.136.40.133:4379 - 2 connected
  2023-06-12T19:39:42.013100+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected

Node 2 things it is still leader, so sends a leader broadcast,
received by node 0:

  2023-06-12T19:39:42.617032+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=2

Note that there aren't enough logs from node 2 to know if it also
received a leader broadcast from node 0.

Now node 2 is gone again:

  2023-06-12T19:39:51.420623+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2

This is only 10s after it was rejoined, so KeepaliveInterval and/or
KeepaliveLimit must be greatly reduced from the defaults.  It is hard
to understand what is happening without knowing which configuration
values have been changed from the defaults.

I can see 1 main problem here: you have an unresolved split brain.
Node 2 does not realise it has lost the cluster lock.  It still thinks
it is leader while node 0 is able to take the lock.  It seems like you
need to reduce timeouts on the cluster lock.  For the standard fcntl
helper, we added a configurable re-check time.

The general situation could be helped by introducing a stronger notion
of cluster membership (where membership is granted by the leader).  At
the moment nodes join the cluster when they are connected and this is
too weak.  Fixing this requires a major redesign, of the type we have
spoken about at SambaXP for years but have not yet had time to do.

By the way, putting "CTDB" in the subject is good because I don't read
anything on the general samba list that doesn't have CTDB in the
subject.  Also, if you post a merge request for CTDB, please add me as
a reviewer so I see it.  I don't get time to check all merge requests.

Thanks...

peace & happiness,
martin