[Samba] ctdb split brain nodes doesn't see each other

cabowabo axel.weber at cbc.de
Thu Jul 3 09:31:54 MDT 2014


Hi,

I’ve setup a simple ctdb cluster. Actually copied the config file from an existing system.

Thats what happens:

Node 1, alone
Number of nodes:2
pnn:0 10.0.0.1         OK (THIS NODE)
pnn:1 10.0.0.2         DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1369816268
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0


Node1, after start of ctdb on Node 2
Number of nodes:2
pnn:0 10.0.0.1         OK (THIS NODE)
pnn:1 10.0.0.2         UNHEALTHY
Generation:1369816268
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0

Node 1, 1 minute later
Number of nodes:2
pnn:0 10.0.0.1         OK (THIS NODE)
pnn:1 10.0.0.2         DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1369816268
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0


Node 2
Number of nodes:2
pnn:0 10.0.0.1         DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.0.0.2         OK (THIS NODE)
Generation:2125944281
Size:1
hash:0 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:1

—> RESULTS in splitbrain, both nodes have the public ip

Node 1 Log
2014/07/03 16:07:59.033170 [33243]: Starting CTDBD as pid : 33243
2014/07/03 16:07:59.036903 [33243]: Vacuuming is disabled for persistent database group_mapping.tdb
2014/07/03 16:07:59.040167 [33243]: Vacuuming is disabled for persistent database account_policy.tdb
2014/07/03 16:07:59.043457 [33243]: Vacuuming is disabled for persistent database share_info.tdb
2014/07/03 16:07:59.046547 [33243]: Vacuuming is disabled for persistent database secrets.tdb
2014/07/03 16:07:59.049848 [33243]: Vacuuming is disabled for persistent database registry.tdb
2014/07/03 16:07:59.052966 [33243]: Vacuuming is disabled for persistent database passdb.tdb
2014/07/03 16:07:59.053005 [33243]: Freeze priority 1
2014/07/03 16:07:59.053378 [33243]: Freeze priority 2
2014/07/03 16:07:59.053602 [33243]: Freeze priority 3
2014/07/03 16:07:59.229670 [33243]: Freeze priority 1
2014/07/03 16:07:59.229780 [33243]: Freeze priority 2
2014/07/03 16:07:59.229863 [33243]: Freeze priority 3
2014/07/03 16:07:59.247015 [33243]: Set DeterministicIPs to 0
2014/07/03 16:07:59.253600 [33243]: Set NoIpFailback to 1
2014/07/03 16:08:03.235484 [33287]: Taking out recovery lock from recovery daemon
2014/07/03 16:08:03.235584 [33287]: Take the recovery lock
2014/07/03 16:08:03.236070 [33287]: Recovery lock taken successfully
2014/07/03 16:08:03.236198 [33287]: Recovery lock taken successfully by recovery daemon
2014/07/03 16:08:03.237080 [33243]: Freeze priority 1
2014/07/03 16:08:03.237189 [33243]: Freeze priority 2
2014/07/03 16:08:03.237274 [33243]: Freeze priority 3
2014/07/03 16:08:03.424076 [33243]: Thawing priority 1
2014/07/03 16:08:03.424117 [33243]: Release freeze handler for prio 1
2014/07/03 16:08:03.424147 [33243]: Thawing priority 2
2014/07/03 16:08:03.424160 [33243]: Release freeze handler for prio 2
2014/07/03 16:08:03.424184 [33243]: Thawing priority 3
2014/07/03 16:08:03.424195 [33243]: Release freeze handler for prio 3
2014/07/03 16:08:03.748739 [33287]: Resetting ban count to 0 for all nodes
2014/07/03 16:08:14.760888 [33287]: Trigger takeoverrun
2014/07/03 16:08:18.574646 [33243]: Starting SMB services: [  OK  ]
2014/07/03 16:08:18.575198 [33243]: Register srvid 18302628885633695744 for client 65746
2014/07/03 16:08:18.575789 [33243]: Deregister srvid 18302628885633695744 for client 65746
2014/07/03 16:08:18.588310 [33243]: Register srvid 18302628885633695744 for client 65746
2014/07/03 16:08:18.591688 [33243]: Deregister srvid 18302628885633695744 for client 65746
2014/07/03 16:08:18.936008 [33287]: Trigger takeoverrun
2014/07/03 16:08:20.288537 [33287]: Trigger takeoverrun
2014/07/03 16:08:23.891691 [33243]: Node became HEALTHY. Ask recovery master 0 to perform ip reallocation
2014/07/03 16:10:39.962127 [33287]: client/ctdb_client.c:759 control timed out. reqid:67831 opcode:80 dstnode:1
2014/07/03 16:10:39.962203 [33287]: client/ctdb_client.c:870 ctdb_control_recv failed
2014/07/03 16:10:39.962219 [33287]: Async operation failed with state 3, opcode:80
2014/07/03 16:10:39.962235 [33287]: Async wait failed - fail_count=1
2014/07/03 16:10:39.962251 [33287]: server/ctdb_recoverd.c:251 Failed to read node capabilities.
2014/07/03 16:10:39.962264 [33287]: server/ctdb_recoverd.c:3041 Unable to update node capabilities.
2014/07/03 16:11:00.984133 [33287]: client/ctdb_client.c:759 control timed out. reqid:67841 opcode:80 dstnode:1
2014/07/03 16:11:00.984201 [33287]: client/ctdb_client.c:870 ctdb_control_recv failed
2014/07/03 16:11:00.984217 [33287]: Async operation failed with state 3, opcode:80
2014/07/03 16:11:00.984234 [33287]: Async wait failed - fail_count=1
2014/07/03 16:11:00.984285 [33287]: server/ctdb_recoverd.c:251 Failed to read node capabilities.
2014/07/03 16:11:00.984301 [33287]: server/ctdb_recoverd.c:3041 Unable to update node capabilities.
2014/07/03 16:11:04.261771 [33287]: ctdb_control error: 'node is disconnected'
2014/07/03 16:11:04.261821 [33287]: ctdb_control error: 'node is disconnected'
2014/07/03 16:11:04.261841 [33287]: Async operation failed with ret=-1 res=-1 opcode=80
2014/07/03 16:11:04.261854 [33287]: Async wait failed - fail_count=1
2014/07/03 16:11:04.261884 [33287]: server/ctdb_recoverd.c:251 Failed to read node capabilities.
2014/07/03 16:11:04.261896 [33287]: server/ctdb_recoverd.c:3041 Unable to update node capabilities.
2014/07/03 16:11:04.261920 [33287]: client/ctdb_client.c:706 reqid 67841 not found
2014/07/03 16:11:04.261947 [33287]: client/ctdb_client.c:706 reqid 67831 not found


Node 2 Log
2014/07/03 16:10:15.590428 [17182]: Starting CTDBD as pid : 17182
2014/07/03 16:10:15.594254 [17182]: Vacuuming is disabled for persistent database account_policy.tdb
2014/07/03 16:10:15.597875 [17182]: Vacuuming is disabled for persistent database registry.tdb
2014/07/03 16:10:15.601015 [17182]: Vacuuming is disabled for persistent database secrets.tdb
2014/07/03 16:10:15.604113 [17182]: Vacuuming is disabled for persistent database share_info.tdb
2014/07/03 16:10:15.607215 [17182]: Vacuuming is disabled for persistent database passdb.tdb
2014/07/03 16:10:15.610304 [17182]: Vacuuming is disabled for persistent database group_mapping.tdb
2014/07/03 16:10:15.610342 [17182]: Freeze priority 1
2014/07/03 16:10:15.610689 [17182]: Freeze priority 2
2014/07/03 16:10:15.610959 [17182]: Freeze priority 3
2014/07/03 16:10:15.787984 [17182]: Freeze priority 1
2014/07/03 16:10:15.788078 [17182]: Freeze priority 2
2014/07/03 16:10:15.788162 [17182]: Freeze priority 3


System Details:
Redhat 6.5

Nodes:
10.0.0.1
10.0.0.2

public_addresses:
10.98.81.2/24 bond0

Ctdb:
CTDB_RECOVERY_LOCK=/mnt/media23/.ctdb_lock/lock.file
CTDB_DEBUGLEVEL=ERR
CTDB_MANAGES_SAMBA=yes
CTDB_PUBLIC_INTERFACE=bond0
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_SET_NoIpFailback=1
CTDB_SET_DeterministicIPs=0


The lock Filesystem is a Stornext Filesystem


Any help would be apreciated.

Cheers


Axel




--
View this message in context: http://samba.2283325.n4.nabble.com/ctdb-split-brain-nodes-doesn-t-see-each-other-tp4668664.html
Sent from the Samba - General mailing list archive at Nabble.com.


More information about the samba mailing list