[Samba] Nodes in CTDB Cluster don't release recovery lock

Martin Schwenke martin at meltin.net
Thu May 7 07:05:20 UTC 2020

Hi Christian,

On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba
<samba at lists.samba.org> wrote:

> First of all, apologies if this isn't the right location for this question,
> I couldn't find a CTDB specific mailing list or IRC so I figured the
> general one would be appropriate. Please let me know if this question is
> better placed elsewhere.

This is the right place.  Thanks for asking!  :-)

> I'm trying to test clustered samba and have a two node CTDB setup
> (Following the guide here:
> https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't seem
> to get the current setup in a totally healthy state. If there is no
> recovery lock enabled, then the two nodes think they are healthy and the
> other is dead, and if there is a recovery lock enabled, one node seizes it
> and never releases it.
> I'm running Debian Buster, with Lustre+ZFS as the clustered file system for
> the recovery lock, and using upstream's Samba and CTDB. The lustre clients
> are mounted with flock, and I've confirmed the lock is being held with
> `lslock`.
> Here is the debug information from the "healthy node" (the unhealthy one
> just says it can't take the lock as it is under contention, so I thought it
> would be of little use):
> https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing

Your main problem seems to be that the nodes are not connecting to each

  $ ls ctdb*.log
  ctdb-full.log  ctdb-locked-out-node.log  ctdb.log
  $ grep "connected to" ctdb*.log

You should see lines something similar to this:

  2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: connected to - 2 connected

Lack of connectivity would also explain the 2nd node trying to take the
recovery lock. It will elect itself leader of its own cluster and try
to take the lock.  Luckily, the nodes are communicating at the
filesystem level, so the lock can't be taken and the 2nd node can not
proceed with database recovery... so split brain is avoided.

Is there a firewall blocking TCP port 4379?

peace & happiness,

