[Samba] Intermittent Event Script Timeouts on CTDB Cluster Nodes

Fri Dec 12 12:39:05 MST 2014

Hi All,

I've got a CTDB cluster, managing NFSv3 and Samba, sitting in front of a GPFS storage cluster.  The NFSv3 piece is carrying some pretty heavy traffic at peak load.  About once every three to four days, CTDB has been exhibiting behaviors that result in IP-failover between two nodes for reasons that are currently unknown.  The exact chain of events has been a little different each time this has happened, so a comprehensive summary is difficult.  However, I will attempt to present the highlights below:

1)  CTDB begins to flap on an affected cluster node.  Clients connected to that node see the NFS server not responding.

2)  After the fail-over, the CTDB log on the affected node is full of complaints about event scripts timing out.  The first script to time out *seems* always to be `61.nfstickle` or rpcinfo itself (perhaps rpcinfo running under the authority of the CTDB event scripts), followed by timing out of event scripts related to releasing IPs for takeover.  Additionally, around the event, we see the IP-receiving peer logging a lot of errors about CTDB control traffic timing out.

3)  Fail-back is attended by similar difficulties.  During the last fail-back procedure (12/09), clients experienced instability (mounts not responding, bizarre permissions errors) while the CTDB hosts continued to complain their logs about timed out event scripts (IP take or release) related.  Finally, CTDB seems to get fed up and restart statd and nfsd and all goes back to normal.

4)  During two of these failure events, CTDB on one of the nodes has actually *died* and had to be restarted for fail-back to occur.

So, in broad strokes, those are the kind of events that I've been seeing in this cluster.  My theory about the cause of this had *previously* centered around load-induced conditions.  While this is still a possibility, digging in the logs and config files has led me to develop another theory.  Namely, that misconfiguration of statd is causing monitoried clients not to appear in shared storage, which is then causing fatal confusion during some failover events.  This theory would postulate that those failovers that are problematic, follow the reboot of some client on the network while failovers that are successful happen after the connections of all rebooted clients have been reset.  The specific configuration option that makes me think statd is misconfigured is this one from /etc/sysconfig/nfs:

"""
STATD_HOSTNAME="$NFS_HOSTNAME -H /etc/ctdb/statd-callout -p 97"
"""

...I notice that the -P parameter is missing from this string, which is described in `man rpc.statd` as follows:

"""
       -P, --state-directory-path pathname
              Specifies the pathname of the parent directory where NSM state information resides.  If this option is not specified, rpc.statd uses /var/lib/nfs/statd by default.
"""

...Also, I know that this parameter string is getting passed to the actual statd invocation, along with an extraneous port specifier, because these messages also appear in /var/log/log.ctdb:

"""
ERROR: STATD is not responding. Trying to restart it. [rpc.statd  -n myservice.tld -H /etc/ctdb/statd-callout -p 97 -p 595 -o 596]
"""

...Looking in /var/lib/nfs/statd, I do see some clients listed on that directory.  However, /etc/sysconfig/nfs also has the following variable definitition:

"""
STATD_SHARED_DIRECTORY=/gs/var/nfs/rfs_shared
"""

So, I'm now wondering if statd is looking in different places at different times for clients to monitor and, in some cases, IP-receiving peers are not able to update their lists of monitored nodes.

Additionally, I'm wondering if anyone on this list has had a similar experience with CTDB.  Also, I'm wondering what the list makes of my current theory regarding the cause of these problems, or if anyone would like to advance an alternate theory if my own is not sound.

Thank you so much for all of your help!

Stewart Howard