ctdb 4.11.2 version failed to recover

Fri Nov 29 07:20:59 UTC 2019

Hi£¬
       I use the ctdb of 4.11.2 version, include the newest patch(https://bugzilla.samba.org/show_bug.cgi?id=14175). But,when I test the NIC exception, I encounter a problem, the ctdb cluster can not
Recover Normal.
       The test steps are as follows:

1¡¢ ctdb cluster have two ndoes, nodeA and nodeB£¬the cluster status is OK¡£

2¡¢ Ifdown the nodeA and nodeB¡¯s NIC, which config private IP.

3¡¢ After 25 seconds, nodeA and nodeB detech each other dead£¬then call the functions: ctdb_tcp_restart->ctdb_tcp_node_connect,
but bind failed, print the log:

node 10.240.226.211:4379 is dead: 0 connected

Tearing down connection to dead node :1

Failed to bind socket (Cannot assign requested address)

ctdb_control error: 'node is disconnected'

ctdb-recoverd[17926]: ../../server/ctdb_client.c:1071 ctdb_control for getnodes failed ret:-1 res:-1

4¡¢ The last, ifup nodeA and nodeB¡®s NIC£¬I wait a long time,the ctdb cluster do not recover ok. I use the netstat ¨Canp | grep ctdb command, do not have the connection with each other.
when NIC is up, the print log is:
ctdb-recoverd[17926]: Interface bond0.120:2 changed state: 0 => 1
Trigger takeoverrun
Takeover run starting
No nodes available to host public IPs yet
Monitoring event was cancelled
Takeover run completed successfully
Lock contention during renew: -16
/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper: Failed to drop lock on RADOS object 'lockctdb' - (No such file or directory)
Recovery lock helper terminated, triggering an election
Recovery mode set to ACTIVE
ctdb-recoverd[17926]: Election period ended
Node:0 was in recovery mode. Start recovery process
ctdb-recoverd[17926]: ../../server/ctdb_recoverd.c:1347 Starting do_recovery
                     Attempting to take recovery lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin cephfs_data lockctdb)
ctdb-recoverd[17926]: Unable to take recovery lock - contention
ctdb-recoverd[17926]: Unable to take recovery lock
ctdb-recoverd[17926]: Abort recovery, ban this node for 300 seconds
ctdb-recoverd[17926]: Banning node 0 for 300 seconds

solution:
       when bind failed, no one will reestablish connections, even the NIC is up. I think when bind failed, we should use the time to retry. The patch is follow, I test it work well¡£

--- a/ctdb/tcp/tcp_connect.c
+++ b/ctdb/tcp/tcp_connect.c
@@ -236,6 +236,11 @@ void ctdb_tcp_node_connect(struct tevent_context *ev, struct tevent_timer *te,
                DBG_ERR("Failed to bind socket (%s)\n", strerror(errno));
                close(tnode->out_fd);
                tnode->out_fd = -1;
+               tnode->connect_te = tevent_add_timer(ctdb->ev,
+                                                       tnode,
+                                                       timeval_current_ofs(5, 0),
+                                                       ctdb_tcp_node_connect,
+                                                       node);
                return;
        }

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb.4.11.2.patch
Type: application/octet-stream
Size: 521 bytes
Desc: ctdb.4.11.2.patch
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20191129/80c0f366/ctdb.4.11.2.obj>