ctdb cluster not healthy:"Unable to take recovery lock - contention"

Sat May 8 11:46:57 UTC 2021

On Sat, 8 May 2021 17:44:43 +0800 (CST), 风无名 via samba-technical
<samba-technical at lists.samba.org> wrote:

> sorry that my attachments are too large.
> and my ctdb version is 4.8.5

> At 2021-05-08 16:34:57, "风无名" <wuming_81 at 163.com> wrote:
> 
> hello, everyone. 
> after I started my ctdb cluster many minutes , my cluter  are still not healthy.
> the logs are in the attachment.
> my cluster consists of three nodes. /etc/hosts file:
> 192.168.200.10 node1
> 192.168.200.20 node2
> 192.168.200.30 node3
> 
> 
> public address config file:
> 192.168.210.10/24 ens15f1
> 192.168.210.30/24 ens15f1
> 192.168.210.20/24 ens15f1
> 
> 
> nodes config file:
> 192.168.200.10
> 192.168.200.30
> 192.168.200.20
> 
> 
> the ctdb lock file is /opt/ctdb/ctdb.lock
> /opt/ctdb/ is a mount point of a glusterfs cluster
> the glusterfs volume :
> [root at node1 ctdb]# gluster v  status clusters_volume_ctdb
> Status of volume: clusters_volume_ctdb
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick 192.168.200.10:/data/ctdb/192.168.200
> .10                                         49153     0          Y       6215 
> Brick 192.168.200.30:/data/ctdb/192.168.200
> .30                                         49152     0          Y       17858
> Brick 192.168.200.20:/data/ctdb/192.168.200
> .20                                         49152     0          Y       9134 
> 
> 
> I have examined the logs of gluster mount point and gluster server nodes and failed to found any anormaly.
> 
> 
> ctdb status of the node1:
> [root at node1 ctdb]# ctdb status
> Number of nodes:3
> pnn:0 192.168.200.10   UNHEALTHY (THIS NODE)
> pnn:1 192.168.200.30   DISCONNECTED|UNHEALTHY|INACTIVE
> pnn:2 192.168.200.20   DISCONNECTED|UNHEALTHY|INACTIVE
> Generation:INVALID
> Size:3
> hash:0 lmaster:0
> hash:1 lmaster:1
> hash:2 lmaster:2
> Recovery mode:RECOVERY (1)
> Recovery master:0
> 
> 
> ctdb status of the node2:
> [root at node2 ctdb]# ctdb status
> Number of nodes:3
> pnn:0 192.168.200.10   DISCONNECTED|UNHEALTHY|INACTIVE
> pnn:1 192.168.200.30   DISCONNECTED|UNHEALTHY|INACTIVE
> pnn:2 192.168.200.20   OK (THIS NODE)
> Generation:1475941203
> Size:1
> hash:0 lmaster:2
> Recovery mode:NORMAL (0)
> Recovery master:2
> 
> 
> ctdb status of node3:
> [root at node3 ~]# ctdb status
> Number of nodes:3
> pnn:0 192.168.200.10   DISCONNECTED|UNHEALTHY|INACTIVE
> pnn:1 192.168.200.30   UNHEALTHY (THIS NODE)
> pnn:2 192.168.200.20   DISCONNECTED|UNHEALTHY|INACTIVE
> Generation:INVALID
> Size:3
> hash:0 lmaster:0
> hash:1 lmaster:1
> hash:2 lmaster:2
> Recovery mode:RECOVERY (1)
> Recovery master:1

The above "ctdb status" output tells you that the CTDB nodes are not
connecting to each other. The logs also do not show the nodes
connecting.  I would look here:

  https://wiki.samba.org/index.php/Basic_CTDB_configuration#Troubleshooting

Is there a firewall blocking connections to TCP port 4379?

> the ping_pong test results:
> (the cluster is running)
> [root at node1 ~]# ping_pong -l  /opt/ctdb/ctdb.lock 
> file already locked, calling check_lock to tell us who has it locked:
> check_lock failed: lock held: pid='0', type='1', start='0', len='1'
> Working POSIX byte range locks
> 
> 
> [root at node2 ~]#  ping_pong -l  /opt/ctdb/ctdb.lock
> file already locked, calling check_lock to tell us who has it locked:
> check_lock failed: lock held: pid='19142', type='1', start='0', len='1'
> Working POSIX byte range locks
> 
> 
> [root at node3 ~]#  ping_pong -l  /opt/ctdb/ctdb.lock
> file already locked, calling check_lock to tell us who has it locked:
> check_lock failed: lock held: pid='0', type='1', start='0', len='1'
> Working POSIX byte range locks
> 
> 
> I have searched many pages for a long time but failed to solve this problem.
> thanks for any advice.

I'm not sure if there is actually a locking issue.  The logs show
contention for the recovery lock, so locking appears to be OK.

I suggest checking why the nodes can't connect to each other via TCP.
As mentioned above, this may be due to a firewall.

By the way, this question really belongs on the "samba" mailing list,
rather than on "samba-technical"...  ;-)

peace & happiness,
martin