[Samba] CTDB Debug Help

Wed Feb 26 10:25:52 MST 2014

Hello,

I've got a two node CTDB/Samba cluster that I'm having trouble with trying 
to add back a node after having to do an OS reload on it.  The servers are 
running CTDB 2.5.1 and Samba 4.1.4 on AIX 7.1 TL2.  The Samba CTDB 
databases and Samba service work fine from the node that was not reloaded. 
 The rebuilt node is failing to re-add itself to the cluster.  I'm looking 
for pointers on what to look for in the ctdb log, or what increased log 
levels I should set to find the cause.

On the working node I can start CTDB and have it go into OK status:

> ctdb status
Number of nodes:2
pnn:0 192.168.20.1     OK (THIS NODE)
pnn:1 192.168.20.2     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:349517425
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0

The ctdb.log has:

2014/02/26 11:58:55.637024 [12452056]: Recovery lock file set to "". 
Disabling recovery lock checking
2014/02/26 11:58:55.648102 [6553670]: Starting CTDBD (Version 2.5.1) as 
PID: 6553670
2014/02/26 11:58:55.768838 [6553670]: Vacuuming is disabled for persistent 
database account_policy.tdb
2014/02/26 11:58:55.817376 [6553670]: Vacuuming is disabled for persistent 
database group_mapping.tdb
2014/02/26 11:58:55.892066 [6553670]: Vacuuming is disabled for persistent 
database passdb.tdb
2014/02/26 11:58:55.939510 [6553670]: Vacuuming is disabled for persistent 
database registry.tdb
2014/02/26 11:58:55.986739 [6553670]: Vacuuming is disabled for persistent 
database secrets.tdb
2014/02/26 11:58:56.034569 [6553670]: Vacuuming is disabled for persistent 
database share_info.tdb
2014/02/26 11:58:56.034715 [6553670]: Freeze priority 1
2014/02/26 11:58:56.040888 [6553670]: Freeze priority 2
2014/02/26 11:58:56.043987 [6553670]: Freeze priority 3
2014/02/26 11:58:56.437970 [6553670]: Freeze priority 1
2014/02/26 11:58:56.438152 [6553670]: Freeze priority 2
2014/02/26 11:58:56.438292 [6553670]: Freeze priority 3
2014/02/26 11:58:59.440570 [recoverd:12320780]: Trigger takeoverrun
2014/02/26 11:58:59.442438 [6553670]: Freeze priority 1
2014/02/26 11:58:59.442598 [6553670]: Freeze priority 2
2014/02/26 11:58:59.442742 [6553670]: Freeze priority 3
2014/02/26 11:59:00.186508 [6553670]: Thawing priority 1
2014/02/26 11:59:00.186610 [6553670]: Release freeze handler for prio 1
2014/02/26 11:59:00.186691 [6553670]: Thawing priority 2
2014/02/26 11:59:00.186737 [6553670]: Release freeze handler for prio 2
2014/02/26 11:59:00.186793 [6553670]: Thawing priority 3
2014/02/26 11:59:00.186838 [6553670]: Release freeze handler for prio 3
2014/02/26 11:59:00.554819 [recoverd:12320780]: Resetting ban count to 0 
for all nodes
2014/02/26 11:59:16.834495 [6553670]: Node became HEALTHY. Ask recovery 
master 0 to perform ip reallocation

When I start ctdbd on the rebuilt node I see the following at the 
beginning of the ctdb log:

2014/02/26 12:10:28.018036 [11141228]: Recovery lock file set to "". 
Disabling recovery lock checking
2014/02/26 12:10:28.030003 [10027232]: Starting CTDBD (Version 2.5.1) as 
PID: 10027232
2014/02/26 12:10:28.152649 [10027232]: Vacuuming is disabled for 
persistent database account_policy.tdb
2014/02/26 12:10:28.204170 [10027232]: Vacuuming is disabled for 
persistent database group_mapping.tdb
2014/02/26 12:10:28.255586 [10027232]: Vacuuming is disabled for 
persistent database passdb.tdb
2014/02/26 12:10:28.307094 [10027232]: Vacuuming is disabled for 
persistent database registry.tdb
2014/02/26 12:10:28.358609 [10027232]: Vacuuming is disabled for 
persistent database secrets.tdb
2014/02/26 12:10:28.410227 [10027232]: Vacuuming is disabled for 
persistent database share_info.tdb
2014/02/26 12:10:28.410334 [10027232]: Freeze priority 1
2014/02/26 12:10:28.414451 [10027232]: Freeze priority 2
2014/02/26 12:10:28.417383 [10027232]: Freeze priority 3
2014/02/26 12:10:30.173294 [10027232]: Freeze priority 1
2014/02/26 12:10:30.173525 [10027232]: Freeze priority 2
2014/02/26 12:10:30.173710 [10027232]: Freeze priority 3
2014/02/26 12:10:33.500359 [10027232]: Freeze priority 1
2014/02/26 12:10:33.639712 [10027232]: Freeze priority 2
2014/02/26 12:10:33.677081 [recoverd:11272374]: 
server/ctdb_recoverd.c:3699 Current recmaster node 0 does not have 
CAP_RECMASTER, but we (node 1) have - force an election
2014/02/26 12:10:33.677186 [10027232]: Freeze priority 1
2014/02/26 12:10:33.677336 [10027232]: Freeze priority 2
2014/02/26 12:10:33.677491 [10027232]: Freeze priority 3
2014/02/26 12:10:33.789645 [10027232]: Freeze priority 3
2014/02/26 12:10:35.132212 [10027232]: common/ctdb_ltdb.c:211 Failed to 
store dynamic data
2014/02/26 12:10:35.301494 [10027232]: 
/usr/smb_cluster/var/lib/ctdb/persistent/passdb.tdb.1:tdb_wipe_all: failed 
to write hash 0
2014/02/26 12:10:35.301555 [10027232]: server/ctdb_freeze.c:470 Failed to 
wipe database for db 'passdb.tdb'
2014/02/26 12:10:37.310573 [10027232]: Freeze priority 1
2014/02/26 12:10:37.311101 [10027232]: Freeze priority 2
2014/02/26 12:10:37.311576 [10027232]: Freeze priority 3
2014/02/26 12:10:37.452567 [10027232]: Failed to open event script 
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:10:37.452633 [10027232]: server/ctdb_recover.c:1006 Failed 
to start recovery
2014/02/26 12:10:37.452703 [10027232]: pnn 1 Invalid reqid 0 in 
ctdb_reply_control
2014/02/26 12:10:40.177172 [10027232]: read error alloc failed for 64
2014/02/26 12:11:37.459277 [10027232]: Freeze priority 1
2014/02/26 12:11:37.459742 [10027232]: Freeze priority 2
2014/02/26 12:11:37.460226 [10027232]: Freeze priority 3
2014/02/26 12:11:37.583899 [10027232]: Failed to open event script 
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:11:37.583952 [10027232]: server/ctdb_recover.c:1006 Failed 
to start recovery
2014/02/26 12:11:37.584015 [10027232]: pnn 1 Invalid reqid 0 in 
ctdb_reply_control
2014/02/26 12:12:37.592541 [10027232]: Freeze priority 1
2014/02/26 12:12:37.593051 [10027232]: Freeze priority 2
2014/02/26 12:12:37.593522 [10027232]: Freeze priority 3
2014/02/26 12:12:37.728160 [10027232]: Failed to open event script 
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:12:37.728215 [10027232]: server/ctdb_recover.c:1006 Failed 
to start recovery
2014/02/26 12:12:37.728280 [10027232]: pnn 1 Invalid reqid 0 in 
ctdb_reply_control
2014/02/26 12:13:37.735576 [10027232]: Freeze priority 1
2014/02/26 12:13:37.736043 [10027232]: Freeze priority 2
2014/02/26 12:13:37.736500 [10027232]: Freeze priority 3
2014/02/26 12:13:37.856533 [10027232]: Failed to open event script 
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:13:37.856639 [10027232]: server/ctdb_recover.c:1006 Failed 
to start recovery
2014/02/26 12:13:37.856705 [10027232]: pnn 1 Invalid reqid 0 in 
ctdb_reply_control
2014/02/26 12:14:37.858397 [10027232]: Banning this node for 300 seconds
2014/02/26 12:14:38.033523 [10027232]: server/ctdb_monitor.c:485 Node 0 
became healthy - force recovery for startup
2014/02/26 12:14:38.581971 [recoverd:11272374]: Refusing to disable 
takeover runs on inactive node
2014/02/26 12:14:38.702472 [10027232]: Failed to open event script 
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:14:38.702524 [10027232]: Failed to run "ipreallocated" event 

2014/02/26 12:15:37.737758 [10027232]: server/ctdb_recover.c:562 Been in 
recovery mode for too long. Dropping all IPS

Back on the first node I see the following added:

2014/02/26 12:10:33.364973 [6553670]: Freeze priority 1
2014/02/26 12:10:33.515860 [6553670]: Freeze priority 2
2014/02/26 12:10:33.657076 [6553670]: Freeze priority 3
2014/02/26 12:10:35.301189 [recoverd:12320780]: Async operation failed 
with ret=0 res=-1 opcode=67
2014/02/26 12:10:35.301275 [recoverd:12320780]: Async wait failed - 
fail_count=1
2014/02/26 12:10:35.301327 [recoverd:12320780]: 
server/ctdb_recoverd.c:1468 Unable to wipe database. Recovery failed.
2014/02/26 12:10:35.301968 [recoverd:12320780]: 
server/ctdb_recoverd.c:1915 Failed to recover database 0x3ef19640

I'm not sure which of these messages is important to look at.  The event 
script directory does exists and contains only 00.ctdb.  I'm not using 
CTDB to manage IP addresses as I have an IP load balancer in front of the 
servers similar to LVS.  Should I be focusing on the error handling the 
ipreallocated event or on the error regarding the passdb.tdb earlier on in 
the log.  I have had a few clusters with this setup for a few years but 
this is the first cluster at CTDB 2.5.1 and Samba 4.1.X.  I don't think 
I've ever had to rebuild a node before so I can't say if my older clusters 
have this issue or not.

John Janosik
jpjanosi at us.ibm.com