[Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes

tu.qiuping tu.qiuping at qq.com
Sun Nov 26 13:13:21 UTC 2023


My ctdb version is 4.17.7


Hello, everyone.
My ctdb cluster configuration is correct and the cluster is healthy before operation.

My cluster has three nodes, namely host-192-168-34-164, host-192-168-34-165, and host-192-168-34-166. And the node host-192-168-34-164 is the leader before operation.


I conducted network oscillation testing on node host-192-168-34-164,I down the interface of private network of ctdb at 19:18:54.091439. Then this node starts to do recovery. What I am puzzled about is that at 19:18:59.822903, this node timed out obtaining a lock, and the log shows “Time out getting recovery lock, allowing recovery mode set any way”,and then host-192-168-34-164 takeover all the virtual ip.


I checked the source code of ctdb and found that lines 578 to 582 of the file samba/ctdb/server/ctdb_recover. c state: Timeout.  Consider this a success, not a failure, as we failed to set the recovery lock which is what we wanted.  This can be caused by the cluster filesystem being very slow to arbitrate locks immediately after a node failure. 


I am puzzled why get the reclock timeout is considered successful. Although a slow cluster file system may cause get reclock timeout, disconnecting the private network of the leader node can also cause this situation. Therefore, this disconnected node will take over all virtual IPs, which will conflict with the virtual IPs of other normal nodes。So, is it inappropriate to assume that get the reclock timeout is successful in this situation?


The logs of the three nodes are attached.


More information about the samba mailing list