[Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

Wed Oct 29 18:32:21 MDT 2014

Amitay Isaacs <amitay at gmail.com> wrote on 10/29/2014 06:16:27 PM:
> 
> On Thu, Oct 30, 2014 at 4:48 AM, John P Janosik <jpjanosi at us.ibm.com> 
wrote:
> Can anyone help with some pointers to debug a problem with Samba and 
CTDB
> with smbstatus traversing the connections tdb?  I've got a new two node
> cluster with Samba and CTDB on AIX.  If I run smbstatus when the server
> has much user activity it hangs and the node it was run on gets banned.  
I
> see the following in the ctdb log:
> 
> 2014/10/29 11:12:45.374580 [3932342]: server/ctdb_traverse.c:644 
Traverse
> cancelled by client disconnect for database:0x6b06a26d
> 2014/10/29 11:12:45.429592 [3932342]: Recovery daemon ping timeout. 
Count
> : 0
> 2014/10/29 11:12:45.430655 [3932342]: Handling event took 195 seconds!
> 2014/10/29 11:12:45.452636 [3932342]: pnn 1 Invalid reqid 220668 in
> ctdb_reply_control
> 2014/10/29 11:12:48.462334 [recoverd:6488266]: 
server/ctdb_recoverd.c:3990
> Remote node:0 has different flags for node 1. It has 0x02 vs our 0x00
> 2014/10/29 11:12:48.462448 [recoverd:6488266]: Use flags 0x00 from local
> recmaster node for cluster update of node 1 flags
> 2014/10/29 11:12:48.483362 [3932342]: Freeze priority 1
> 2014/10/29 11:12:58.574548 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:08.569593 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:18.604124 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:28.638279 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:38.653236 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:48.454994 [3932342]: Recovery daemon ping timeout. 
Count
> : 0
> 2014/10/29 11:13:48.484659 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:13:48.484723 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:13:48.484782 [recoverd:6488266]: Async operation failed 
with
> ret=-1 res=-1 opcode=33
> 2014/10/29 11:13:48.484837 [recoverd:6488266]: Failed to freeze node 1
> during recovery. Set it as ban culprit for 2 credits
> 2014/10/29 11:13:48.484890 [recoverd:6488266]: Async wait failed -
> fail_count=1
> 2014/10/29 11:13:48.485047 [recoverd:6488266]: 
server/ctdb_recoverd.c:410
> Unable to freeze nodes. Recovery failed.
> 2014/10/29 11:13:48.485095 [recoverd:6488266]: 
server/ctdb_recoverd.c:1881
> Unable to set recovery mode to active on cluster
> 2014/10/29 11:13:48.503273 [3932342]: Freeze priority 1
> 2014/10/29 11:13:48.688292 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:58.704454 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:08.739136 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:18.753675 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:28.788835 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:38.824317 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:48.486128 [3932342]: Recovery daemon ping timeout. 
Count
> : 0
> 2014/10/29 11:14:48.504242 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:14:48.504298 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:14:48.504357 [recoverd:6488266]: Async operation failed 
with
> ret=-1 res=-1 opcode=33
> 2014/10/29 11:14:48.504411 [recoverd:6488266]: Failed to freeze node 1
> during recovery. Set it as ban culprit for 2 credits
> 2014/10/29 11:14:48.504463 [recoverd:6488266]: Async wait failed -
> fail_count=1
> 2014/10/29 11:14:48.504507 [recoverd:6488266]: 
server/ctdb_recoverd.c:410
> Unable to freeze nodes. Recovery failed.
> 2014/10/29 11:14:48.504552 [recoverd:6488266]: 
server/ctdb_recoverd.c:1881
> Unable to set recovery mode to active on cluster
> 2014/10/29 11:14:48.506682 [3932342]: Banning this node for 300 seconds
> 
> The tdb mentioned in the first log message is:
> 
> ctdb getdbmap | grep 0x6b06a26d
> dbid:0x6b06a26d name:smbXsrv_session_global.tdb
> path:/usr/smb_cluster/var/lib/ctdb/smbXsrv_session_global.tdb.1
> 
> I was going to start by trying to update
> /usr/smb_cluster/etc/ctdb/debug_locks.sh since I don't think flock is
> available on AIX.  I'm hoping the ctdb log will then contain some more
> help debugging the problem.  Any other ideas on how to debug this?

> 
> These logs are incomplete.  They do not include the traverse 
> messages (Starting/Ending traverse) and why the node became 
> unhealthy.  Without complete logs it's difficult to figure out 
> what's happening.  However, CTDB is definitely busy.  If you have 
> lots of processes, may be processes are getting swapped to disk.  
> Usually AIX boxes are configured with small amount of memory.
> 
>     2014/10/29 11:12:45.430655 [3932342]: Handling event took 195 
seconds!

> Node is getting banned because it could not be frozen.  You cannot 
> use the shipped debug_locks.sh script on AIX.  That script uses /
> proc/locks to extract information about current locks.  There is no 
> /proc/locks on AIX.  I do not know if there is any other way of 
> finding information about processes holding fcntl locks.
> 
> Amitay. 

Hi Amitay -

The AIX instances have 32GB with 4GB dedicated to an AFS filesystem cache. 
 At peak times there are about 100 smbd processes on each node, but the 
clients aren't that active.  I have noticed there are a subset of the 
clients doing successful netbios session requests on port 139 and then 
don't do anything more.  The clients do this every minute.  Perhaps this 
is driving load on the tdb that smbstatus is trying to traverse.  There 
doesn't appear to be a problem with paging according to the nmon data 
collected on the servers but I can give them some more memory as a test. I 
didn't see anything else in the logs at the time that smbstatus was run, 
here is the complete log from startup until ctdb was killed.

I'm going to try duplicating the problem with an increased ctdb log level 
to see if I can see the starting/ending traverse messages.

Thanks,

John