[Samba] 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes

Tue Nov 28 15:18:14 UTC 2023

Thank you for your reply, Martin.

As you said, "If the leader becomes disconnected then the cluster lock will need to be released (e.g. by the underlying filesystem) before another node can be leader.  2 nodes should never be leader at the same time.", so what I want to ask is, how does the leader node know it has a problem when the network is abnormal. From the log, it thinks that other nodes are experiencing network anomalies and have taken over all virtual IPs. Or is it true that cluster management is incomplete in this situation.

I have another question, which is why there is no log of renewing reclock failure for such a long time after disconnecting the network of the leader node, as the duration of the lock I set is 2 seconds

What I think is that it should renew reclock failed and kept trying instead of taking over all virtual IPs first , after the leader node's network is abnormal.

Here are the logs for three nodes：
host-192-168-34-164:
2023-11-23T19:18:54.091287+08:00 host-192-168-34-164 ctdbd[6708]: dead count reached for node 1
2023-11-23T19:18:54.091331+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down connection to dead node :1
2023-11-23T19:18:54.091378+08:00 host-192-168-34-164 ctdbd[6708]: 100.100.34.164:4379: node 100.100.34.165:4379 is dead: 1 connected
2023-11-23T19:18:54.091411+08:00 host-192-168-34-164 ctdbd[6708]: dead count reached for node 2
2023-11-23T19:18:54.091416+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down connection to dead node :2
2023-11-23T19:18:54.091439+08:00 host-192-168-34-164 ctdbd[6708]: 100.100.34.164:4379: node 100.100.34.166:4379 is dead: 0 connected
2023-11-23T19:18:54.091431+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091460+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091475+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.091491+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091498+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091504+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.091523+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async wait failed - fail_count=2
2023-11-23T19:18:54.091532+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.091538+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities.
2023-11-23T19:18:54.091543+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:54.093043+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:2782 The vnnmap count is different from the number of active lmaster nodes: 3 vs 1
2023-11-23T19:18:54.093065+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery
2023-11-23T19:18:54.093070+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with node 0
2023-11-23T19:18:54.093108+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags
2023-11-23T19:18:54.127260+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode set to ACTIVE
2023-11-23T19:18:54.127318+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set recovery mode to ACTIVE
2023-11-23T19:18:54.127390+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has started
2023-11-23T19:18:54.175221+08:00 host-192-168-34-164 ctdb-recovery[3208978]: start_recovery event finished
2023-11-23T19:18:54.175289+08:00 host-192-168-34-164 ctdb-recovery[3208978]: updated VNNMAP
2023-11-23T19:18:54.175302+08:00 host-192-168-34-164 ctdb-recovery[3208978]: recover database 0x6645c6c4
2023-11-23T19:18:54.175369+08:00 host-192-168-34-164 ctdbd[6708]: Freeze db: ctdb.tdb
2023-11-23T19:18:54.820651+08:00 host-192-168-34-164 ctdbd[6708]: Thaw db: ctdb.tdb generation 1832634741
2023-11-23T19:18:54.820686+08:00 host-192-168-34-164 ctdbd[6708]: Release freeze handle for db ctdb.tdb
2023-11-23T19:18:54.820705+08:00 host-192-168-34-164 ctdbd[6708]: Resent calls for database=ctdb.tdb, generation=1832634741, count=0
2023-11-23T19:18:54.820908+08:00 host-192-168-34-164 ctdb-recovery[3208978]: 1 of 1 databases recovered
2023-11-23T19:18:54.820952+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode set to NORMAL
2023-11-23T19:18:59.822903+08:00 host-192-168-34-164 ctdbd[6708]: ../../ctdb/server/ctdb_recover.c:585Time out getting recovery lock, allowing recmode set anyway
2023-11-23T19:18:59.823217+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set recovery mode to NORMAL
2023-11-23T19:18:59.823441+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has finished
2023-11-23T19:18:59.863137+08:00 host-192-168-34-164 ctdb-recovery[3208978]: recovered event finished
2023-11-23T19:18:59.863235+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting
2023-11-23T19:18:59.876831+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP 101.101.34.172/16 on interface bond0.909
2023-11-23T19:18:59.877065+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP 101.101.34.171/16 on interface bond0.909
2023-11-23T19:18:59.960718+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.172 before take ip
2023-11-23T19:18:59.960756+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.172 before take ip success
2023-11-23T19:18:59.960764+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ganesha-101.101.34.172
2023-11-23T19:19:00.030334+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip start **********
2023-11-23T19:19:00.030369+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: container ganesha-101.101.34.172 is exist, restart it
2023-11-23T19:19:00.030375+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip end ************
2023-11-23T19:19:00.137205+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.171 before take ip
2023-11-23T19:19:00.137245+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.171 before take ip success
2023-11-23T19:19:00.137254+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ganesha-101.101.34.171
2023-11-23T19:19:00.172884+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip start **********
2023-11-23T19:19:00.172915+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: container ganesha-101.101.34.171 is exist, restart it
2023-11-23T19:19:00.172921+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip end ************
2023-11-23T19:19:00.231998+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: Reconfiguring service "nfs"...
2023-11-23T19:19:00.237079+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run completed successfully
2023-11-23T19:19:00.237119+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete
2023-11-23T19:19:00.237126+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Resetting ban count to 0 for all nodes
2023-11-23T19:19:00.237132+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Just finished a recovery. New recoveries will now be suppressed for the rerecovery timeout (10 seconds)
2023-11-23T19:19:00.237138+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Disabling recoveries for 10 seconds
2023-11-23T19:19:01.319639+08:00 host-192-168-34-164 ctdbd[6708]: iface[bond0.909] has changed it's link status up => down
2023-11-23T19:19:01.320721+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ERROR: public network interface bond0 is down
2023-11-23T19:19:01.320756+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor event failed
2023-11-23T19:19:01.320794+08:00 host-192-168-34-164 ctdbd[6708]: monitor event failed - disabling node
2023-11-23T19:19:01.321211+08:00 host-192-168-34-164 ctdbd[6708]: Node became UNHEALTHY. Ask recovery master to reallocate IPs
2023-11-23T19:19:02.240905+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Interface bond0.909 changed state: 1 => 0
2023-11-23T19:19:02.241879+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Trigger takeoverrun
2023-11-23T19:19:02.241945+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting
2023-11-23T19:19:02.247568+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.172
2023-11-23T19:19:02.247599+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.171
2023-11-23T19:19:02.247604+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.170
2023-11-23T19:19:02.247977+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.172/16 on interface bond0.909  node:-1
2023-11-23T19:19:02.248325+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.171/16 on interface bond0.909  node:-1
2023-11-23T19:19:02.248632+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.170/16 on interface bond0.909  node:-1
2023-11-23T19:19:02.264822+08:00 host-192-168-34-164 ctdbd[6708]: Starting traverse on DB ctdb.tdb (id 4014103)
2023-11-23T19:19:02.265666+08:00 host-192-168-34-164 ctdbd[6708]: Ending traverse on DB ctdb.tdb (id 4014103), records 0
2023-11-23T19:19:06.321754+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:10.238115+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Reenabling recoveries after timeout
2023-11-23T19:19:11.322500+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:12.576955+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start **********
2023-11-23T19:19:12.576987+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ganesha-101.101.34.172
2023-11-23T19:19:12.576993+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: stop ganesha-101.101.34.172 success
2023-11-23T19:19:12.576997+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip end ************
2023-11-23T19:19:16.323713+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:21.324677+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:22.863158+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start **********
2023-11-23T19:19:22.863196+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ganesha-101.101.34.171
2023-11-23T19:19:22.863202+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: stop ganesha-101.101.34.171 success
2023-11-23T19:19:22.863206+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip end ************
2023-11-23T19:19:26.325402+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:29.254184+08:00 host-192-168-34-164 ctdb-takeover[3213011]: RELEASE_IP 101.101.34.170 failed on node 0, ret=110
2023-11-23T19:19:29.254226+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Assigning banning credits to node 0
2023-11-23T19:19:29.254270+08:00 host-192-168-34-164 ctdb-takeover[3213011]: takeover run failed, ret=110
2023-11-23T19:19:29.254429+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_takeover_run() failed
2023-11-23T19:19:29.254461+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run unsuccessful
2023-11-23T19:19:29.255705+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Assigned IP 101.101.34.170 not on an interface
2023-11-23T19:19:29.255725+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Trigger takeoverrun
2023-11-23T19:19:29.255937+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting
2023-11-23T19:19:29.262165+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.172
2023-11-23T19:19:29.262213+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.171
2023-11-23T19:19:29.262220+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.170
2023-11-23T19:19:31.326780+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled
2023-11-23T19:19:32.249734+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start **********
2023-11-23T19:19:32.250519+08:00 host-192-168-34-164 ctdb-eventd[6746]: releaseip event timed out
2023-11-23T19:19:32.251108+08:00 host-192-168-34-164 ctdbd[6708]: Ignoring hung script for releaseip event
2023-11-23T19:19:32.251664+08:00 host-192-168-34-164 ctdbd[6708]: pnn 0 Invalid reqid 4014083 in ctdb_reply_control
2023-11-23T19:19:32.283931+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: Reconfiguring service "nfs"...
2023-11-23T19:19:32.288809+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run completed successfully
2023-11-23T19:19:34.316783+08:00 host-192-168-34-164 ctdbd[6708]: Starting traverse on DB ctdb.tdb (id 4014184)
2023-11-23T19:19:34.317661+08:00 host-192-168-34-164 ctdbd[6708]: Ending traverse on DB ctdb.tdb (id 4014184), records 0
2023-11-23T19:19:36.348451+08:00 host-192-168-34-164 ctdb-eventd[6746]: 05.system: WARNING: System memory utilization 85% >= threshold 80%
2023-11-23T19:19:36.378082+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ERROR: public network interface bond0 is down
2023-11-23T19:19:36.378118+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor event failed

......

host-192-168-34-165:
2023-11-23T16:53:01.979285+08:00 host-192-168-34-165 ctdb-eventd[6610]: 05.system: NOTICE: System memory utilization 79% < threshold 80%
2023-11-23T16:53:12.671558+08:00 host-192-168-34-165 ctdb-eventd[6610]: 05.system: WARNING: System memory utilization 80% >= threshold 80%
2023-11-23T19:18:54.090491+08:00 host-192-168-34-165 ctdbd[6607]: dead count reached for node 0
2023-11-23T19:18:54.090547+08:00 host-192-168-34-165 ctdbd[6607]: Tearing down connection to dead node :0
2023-11-23T19:18:54.090598+08:00 host-192-168-34-165 ctdbd[6607]: 100.100.34.165:4379: node 100.100.34.164:4379 is dead: 1 connected
2023-11-23T19:18:54.090658+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090696+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090706+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.090714+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async wait failed - fail_count=1
2023-11-23T19:18:54.090720+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.090727+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities
2023-11-23T19:18:54.090733+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:56.311737+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader broadcast timeout
2023-11-23T19:18:56.311782+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start election
2023-11-23T19:18:56.311874+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode set to ACTIVE
2023-11-23T19:18:56.312221+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:18:58.313719+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Unable to take cluster lock - timeout
2023-11-23T19:19:01.313924+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader broadcast timeout
2023-11-23T19:19:01.313970+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start election
2023-11-23T19:19:01.314313+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:19:01.425277+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Cluster lock taken successfully
2023-11-23T19:19:01.425323+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Took cluster lock, leader=1
2023-11-23T19:19:01.426544+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:1 was in recovery mode. Start recovery process
2023-11-23T19:19:01.426591+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:2 was in recovery mode. Start recovery process
2023-11-23T19:19:01.426600+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery
2023-11-23T19:19:01.426606+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with node 0
2023-11-23T19:19:01.426968+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags
2023-11-23T19:19:01.427006+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper"
2023-11-23T19:19:01.436697+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set recovery mode to ACTIVE
2023-11-23T19:19:01.436760+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has started
2023-11-23T19:19:01.491485+08:00 host-192-168-34-165 ctdb-recovery[1322606]: start_recovery event finished
2023-11-23T19:19:01.491851+08:00 host-192-168-34-165 ctdb-recovery[1322606]: updated VNNMAP
2023-11-23T19:19:01.491888+08:00 host-192-168-34-165 ctdb-recovery[1322606]: recover database 0x6645c6c4
2023-11-23T19:19:01.492199+08:00 host-192-168-34-165 ctdbd[6607]: Freeze db: ctdb.tdb
2023-11-23T19:19:01.506697+08:00 host-192-168-34-165 ctdbd[6607]: Thaw db: ctdb.tdb generation 996306493
2023-11-23T19:19:01.506719+08:00 host-192-168-34-165 ctdbd[6607]: Release freeze handle for db ctdb.tdb
2023-11-23T19:19:01.506736+08:00 host-192-168-34-165 ctdbd[6607]: Resent calls for database=ctdb.tdb, generation=996306493, count=0
2023-11-23T19:19:01.506985+08:00 host-192-168-34-165 ctdb-recovery[1322606]: 1 of 1 databases recovered
2023-11-23T19:19:01.507034+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode set to NORMAL
2023-11-23T19:19:01.615271+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set recovery mode to NORMAL
2023-11-23T19:19:01.615383+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has finished
2023-11-23T19:19:01.668818+08:00 host-192-168-34-165 ctdb-recovery[1322606]: recovered event finished
2023-11-23T19:19:01.668895+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Takeover run starting
2023-11-23T19:19:01.668963+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set takeover_helper to "/usr/libexec/ctdb/ctdb_takeover_helper"
2023-11-23T19:19:01.854276+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Takeover run completed successfully
2023-11-23T19:19:01.854317+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete
2023-11-23T19:19:01.854325+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Resetting ban count to 0 for all nodes
2023-11-23T19:19:01.854331+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Just finished a recovery. New recoveries will now be suppressed for the rerecovery timeout (10 seconds)
2023-11-23T19:19:01.854341+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Disabling recoveries for 10 seconds
2023-11-23T19:19:11.855338+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Reenabling recoveries after timeout

host-192-168-34-166:
2023-11-23T19:18:54.090703+08:00 host-192-168-34-166 ctdbd[6611]: dead count reached for node 0
2023-11-23T19:18:54.090747+08:00 host-192-168-34-166 ctdbd[6611]: Tearing down connection to dead node :0
2023-11-23T19:18:54.090789+08:00 host-192-168-34-166 ctdbd[6611]: 100.100.34.166:4379: node 100.100.34.164:4379 is dead: 1 connected
2023-11-23T19:18:54.090935+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090962+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090970+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.090975+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async wait failed - fail_count=1
2023-11-23T19:18:54.090980+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.090985+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities
2023-11-23T19:18:54.090989+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:56.312040+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode set to ACTIVE
2023-11-23T19:18:56.312591+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:18:58.314499+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Unable to take cluster lock - timeout
2023-11-23T19:19:01.436922+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has started
2023-11-23T19:19:01.492286+08:00 host-192-168-34-166 ctdbd[6611]: Freeze db: ctdb.tdb
2023-11-23T19:19:01.506836+08:00 host-192-168-34-166 ctdbd[6611]: Thaw db: ctdb.tdb generation 996306493
2023-11-23T19:19:01.506859+08:00 host-192-168-34-166 ctdbd[6611]: Release freeze handle for db ctdb.tdb
2023-11-23T19:19:01.506876+08:00 host-192-168-34-166 ctdbd[6611]: Resent calls for database=ctdb.tdb, generation=996306493, count=0
2023-11-23T19:19:01.507116+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode set to NORMAL
2023-11-23T19:19:01.615475+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has finished
2023-11-23T19:19:01.669217+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Disabling takeover runs for 60 seconds
2023-11-23T19:19:01.681095+08:00 host-192-168-34-166 ctdbd[6611]: Takeover of IP 101.101.34.170/16 on interface bond0.909
2023-11-23T19:19:01.749681+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: stop ganesha-101.101.34.170 before take ip
2023-11-23T19:19:01.749709+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: stop ganesha-101.101.34.170 before take ip success
2023-11-23T19:19:01.749716+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: ganesha-101.101.34.170
2023-11-23T19:19:01.788451+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: ********** this is nfs_takeip start **********
2023-11-23T19:19:01.788480+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: container ganesha-101.101.34.170 is exist, restart it
2023-11-23T19:19:01.788485+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: ********** this is nfs_takeip end ************
2023-11-23T19:19:01.839082+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Received leader broadcast, leader=1
2023-11-23T19:19:01.848160+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: Reconfiguring service "nfs"...
2023-11-23T19:19:01.854402+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Reenabling takeover runs
2023-11-23T19:19:03.882907+08:00 host-192-168-34-166 ctdbd[6611]: Starting traverse on DB ctdb.tdb (id 324774)
2023-11-23T19:19:03.883833+08:00 host-192-168-34-166 ctdbd[6611]: Ending traverse on DB ctdb.tdb (id 324774), records 0

------------------ 原始邮件 ------------------
发件人:                                                                                                                        "Martin Schwenke"                                                                                    <martin at meltin.net>;
发送时间: 2023年11月28日(星期二) 上午10:22
收件人: "samba"<samba at lists.samba.org>;
抄送: "tu.qiuping"<tu.qiuping at qq.com>;
主题: Re: [Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes

Hi,

On Sun, 26 Nov 2023 21:13:21 +0800, "tu.qiuping via samba"
<samba at lists.samba.org> wrote:

> My ctdb version is 4.17.7
> 
> 
> My ctdb cluster configuration is correct and the cluster is healthy before operation.
> 
> My cluster has three nodes, namely host-192-168-34-164,
> host-192-168-34-165, and host-192-168-34-166. And the node
> host-192-168-34-164 is the leader before operation.
> 
> 
> I conducted network oscillation testing on node
> host-192-168-34-164，I down the interface of private network of ctdb
> at 19:18:54.091439. Then this node starts to do recovery. What I am
> puzzled about is that at 19:18:59.822903, this node timed out
> obtaining a lock, and the log shows “Time out getting recovery lock,
> allowing recovery mode set any way”，and then host-192-168-34-164
> takeover all the virtual ip.
> 
> 
> I checked the source code of ctdb and found that lines 578 to 582 of
> the file samba/ctdb/server/ctdb_recover. c state: Timeout.&nbsp;
> Consider this a success, not a failure, as we failed to set the
> recovery lock which is what we wanted.&nbsp; This can be caused by
> the cluster filesystem being very slow to arbitrate locks immediately
> after a node failure.&nbsp;
> 
> 
> I am puzzled why get the reclock timeout is considered successful.
> Although a slow cluster file system may cause get reclock timeout,
> disconnecting the private network of the leader node can also cause
> this situation. Therefore, this disconnected node will take over all
> virtual IPs, which will conflict with the virtual IPs of other normal
> nodes。So, is it inappropriate to assume that get the reclock timeout
> is successful in this situation？

It is considered successful because it wasn't able to take the lock in
a conflicting way.

This part of the code doesn't have anything to do with selecting the
leader node.  It is a sanity check at the end of recovery to ensure
that cluster lock can't be taken by nodes/processes that should not be
able to take it.  Given the way this lock is now used to determine the
leader, this check is definitely in the wrong place, since this is a
cluster management issue, not a recovery issue.

If the leader becomes disconnected then the cluster lock will need to
be released (e.g. by the underlying filesystem) before another node can
be leader.  2 nodes should never be leader at the same time.

> The logs of the three nodes are attached.

I think mailman might have stripped the logs...  :-(

I think I need to see the logs before I can say anything very useful...

peace & happiness,
martin