ctdb 4.11.2 version failed to recover
耿纪超
gengjichao at jd.com
Fri Nov 29 07:20:59 UTC 2019
Hi£¬
I use the ctdb of 4.11.2 version, include the newest patch(https://bugzilla.samba.org/show_bug.cgi?id=14175). But,when I test the NIC exception, I encounter a problem, the ctdb cluster can not
Recover Normal.
The test steps are as follows:
1¡¢ ctdb cluster have two ndoes, nodeA and nodeB£¬the cluster status is OK¡£
2¡¢ Ifdown the nodeA and nodeB¡¯s NIC, which config private IP.
3¡¢ After 25 seconds, nodeA and nodeB detech each other dead£¬then call the functions: ctdb_tcp_restart->ctdb_tcp_node_connect,
but bind failed, print the log:
node 10.240.226.211:4379 is dead: 0 connected
Tearing down connection to dead node :1
Failed to bind socket (Cannot assign requested address)
ctdb_control error: 'node is disconnected'
ctdb-recoverd[17926]: ../../server/ctdb_client.c:1071 ctdb_control for getnodes failed ret:-1 res:-1
4¡¢ The last, ifup nodeA and nodeB¡®s NIC£¬I wait a long time,the ctdb cluster do not recover ok. I use the netstat ¨Canp | grep ctdb command, do not have the connection with each other.
when NIC is up, the print log is:
ctdb-recoverd[17926]: Interface bond0.120:2 changed state: 0 => 1
Trigger takeoverrun
Takeover run starting
No nodes available to host public IPs yet
Monitoring event was cancelled
Takeover run completed successfully
Lock contention during renew: -16
/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper: Failed to drop lock on RADOS object 'lockctdb' - (No such file or directory)
Recovery lock helper terminated, triggering an election
Recovery mode set to ACTIVE
ctdb-recoverd[17926]: Election period ended
Node:0 was in recovery mode. Start recovery process
ctdb-recoverd[17926]: ../../server/ctdb_recoverd.c:1347 Starting do_recovery
Attempting to take recovery lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin cephfs_data lockctdb)
ctdb-recoverd[17926]: Unable to take recovery lock - contention
ctdb-recoverd[17926]: Unable to take recovery lock
ctdb-recoverd[17926]: Abort recovery, ban this node for 300 seconds
ctdb-recoverd[17926]: Banning node 0 for 300 seconds
solution:
when bind failed, no one will reestablish connections, even the NIC is up. I think when bind failed, we should use the time to retry. The patch is follow, I test it work well¡£
--- a/ctdb/tcp/tcp_connect.c
+++ b/ctdb/tcp/tcp_connect.c
@@ -236,6 +236,11 @@ void ctdb_tcp_node_connect(struct tevent_context *ev, struct tevent_timer *te,
DBG_ERR("Failed to bind socket (%s)\n", strerror(errno));
close(tnode->out_fd);
tnode->out_fd = -1;
+ tnode->connect_te = tevent_add_timer(ctdb->ev,
+ tnode,
+ timeval_current_ofs(5, 0),
+ ctdb_tcp_node_connect,
+ node);
return;
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb.4.11.2.patch
Type: application/octet-stream
Size: 521 bytes
Desc: ctdb.4.11.2.patch
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20191129/80c0f366/ctdb.4.11.2.obj>
More information about the samba-technical
mailing list