[Samba] Nodes in CTDB Cluster don't release recovery lock

Thu May 7 18:57:35 UTC 2020

Hello all,

Thanks for the input. I opened up the firewall ports and testing
connectivity by ctdb's ping to no avail.

I did however fix the problem. I must have missed the section of the guide
outlining the importance of the nodes file, but it seems the issue was that
machine A's nodes file was in reverse order compared to B's. After
rectifying that the cluster came up without issue, and the issue has been
resolved.

Thanks everyone for your time!
Christian

On Thu, May 7, 2020 at 12:06 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Christian,
>
> On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba
> <samba at lists.samba.org> wrote:
>
> > First of all, apologies if this isn't the right location for this
> question,
> > I couldn't find a CTDB specific mailing list or IRC so I figured the
> > general one would be appropriate. Please let me know if this question is
> > better placed elsewhere.
>
> This is the right place.  Thanks for asking!  :-)
>
> > I'm trying to test clustered samba and have a two node CTDB setup
> > (Following the guide here:
> > https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't
> seem
> > to get the current setup in a totally healthy state. If there is no
> > recovery lock enabled, then the two nodes think they are healthy and the
> > other is dead, and if there is a recovery lock enabled, one node seizes
> it
> > and never releases it.
> >
> > I'm running Debian Buster, with Lustre+ZFS as the clustered file system
> for
> > the recovery lock, and using upstream's Samba and CTDB. The lustre
> clients
> > are mounted with flock, and I've confirmed the lock is being held with
> > `lslock`.
> >
> > Here is the debug information from the "healthy node" (the unhealthy one
> > just says it can't take the lock as it is under contention, so I thought
> it
> > would be of little use):
> >
> https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing
>
> Your main problem seems to be that the nodes are not connecting to each
> other:
>
>   $ ls ctdb*.log
>   ctdb-full.log  ctdb-locked-out-node.log  ctdb.log
>   $ grep "connected to" ctdb*.log
>   $
>
> You should see lines something similar to this:
>
>   2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: 127.0.0.3:4379:
> connected to 127.0.0.2:4379 - 2 connected
>
> Lack of connectivity would also explain the 2nd node trying to take the
> recovery lock. It will elect itself leader of its own cluster and try
> to take the lock.  Luckily, the nodes are communicating at the
> filesystem level, so the lock can't be taken and the 2nd node can not
> proceed with database recovery... so split brain is avoided.
>
> Is there a firewall blocking TCP port 4379?
>
> peace & happiness,
> martin
>

-- 
 <https://opendrives.com/wp-content/uploads/2020/04/OD-Anywhere.pdf>