CTDB fail-over fails, NFS only, glusterfs backend, for XenServer VMs

Wed Feb 1 06:50:38 MST 2012

Hi All,

i have a problem with my CTDB 2 node NFS cluster while nodes take over 
foreign resources.

Only NFS HA, 2 nodes for now, will grow later.
CentOS 6.2: 2.6.32-220.4.1.el6.x86_64
CTDB: 1.0.114.3
Gluster 3.2.5 for shared storage.
XenServer 5.6 SP2 for NFS client (=RH 5.*).

Once both nodes are up, the NFS service works using any of the two nodes 
concurrently (round robin DNS). But once i shutdown one NFS storage 
node, everything fails. :-)
Here is what log says when shutting down s3 node and keeping s2 up.

2012/02/01 15:08:03.219127 [ 1983]: server/ctdb_recoverd.c:3295 The 
vnnmap count is different from the number of active nodes. 2 vs 1
2012/02/01 15:08:03.219159 [ 1983]: Taking out recovery lock from 
recovery daemon
2012/02/01 15:08:03.219166 [ 1983]: Take the recovery lock
2012/02/01 15:08:03.223519 [ 1983]: Recovery lock taken successfully
2012/02/01 15:08:03.223552 [ 1983]: Recovery lock taken successfully by 
recovery daemon
2012/02/01 15:08:03.223610 [ 1933]: Freeze priority 1
2012/02/01 15:08:03.223735 [ 1933]: Freeze priority 2
2012/02/01 15:08:03.223836 [ 1933]: Freeze priority 3
2012/02/01 15:08:03.307360 [ 1933]: Thawing priority 1
2012/02/01 15:08:03.307379 [ 1933]: Release freeze handler for prio 1
2012/02/01 15:08:03.307389 [ 1933]: Thawing priority 2
2012/02/01 15:08:03.307394 [ 1933]: Release freeze handler for prio 2
2012/02/01 15:08:03.307401 [ 1933]: Thawing priority 3
2012/02/01 15:08:03.307406 [ 1933]: Release freeze handler for prio 3
2012/02/01 15:08:03.402667 [ 1933]: Sending NFS tickle ack for 
10.31.0.41 to 10.31.0.21:980
2012/02/01 15:08:03.405015 [ 1933]: Sending NFS tickle ack for 
10.31.0.41 to 10.31.0.22:1018
2012/02/01 15:08:03.506316 [ 1983]: Resetting ban count to 0 for all nodes
2012/02/01 15:08:14.508640 [ 1983]: Trigger takeoverrun
2012/02/01 15:08:16.267190 [ 1933]: rpcinfo: RPC: Program not registered
2012/02/01 15:08:16.272316 [ 1933]: ERROR: STATD is not responding. 
Trying to restart it. [rpc.statd  -n stor.v.c.XX.XX -H 
/etc/ctdb/statd-callout -p 97 -p 595 -o 596]
2012/02/01 15:08:46.177912 [ 1933]: Event script timed out : 60.nfs 
monitor  count : 0  pid : 14718
2012/02/01 15:08:46.177957 [ 1933]: server/eventscript.c:539 Sending 
SIGTERM to child pid:14718
2012/02/01 15:08:46.178012 [14718]: Timed out running script 
'/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :14718
2012/02/01 15:08:46.193493 [14718]: Logged timedout eventscript : { 
pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } 
 >/tmp/ctdb.event.20120201150846.14718
2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc 
requests
2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask recovery 
master 0 to perform ip reallocation
2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:09:31.343297 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:09:46.360558 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:10:01.377255 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:10:16.394231 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:10:31.411283 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:10:46.428437 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:11:01.445068 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:11:16.462270 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:11:31.479486 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:11:46.496276 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:12:01.512953 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:12:16.530080 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:12:31.547281 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:12:46.564218 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:13:01.580857 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:13:16.597282 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:13:31.614385 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
....
2012/02/01 15:22:02.194870 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:22:17.212204 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:22:32.229226 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:22:47.246081 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:23:02.262845 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:23:17.280197 [ 1933]: ERROR: more than 3 consecutive 
failures for 01.reclock, marking cluster unhealthy
2012/02/01 15:23:32.456560 [ 1933]: Node became HEALTHY. Ask recovery 
master 0 to perform ip reallocation

Here is cat of that debug file:
[root at s2 ~]# cat /tmp/ctdb.event.20120201150846.14718
init(1)-+-01.reclock(14666)---stat(14668)
         |-auditd(1439)---{auditd}(1440)
         |-crond(1864)
         |-ctdbd(1933)-+-ctdbd(1983)
         |             `-ctdbd(14718)-+-60.nfs(14719)-+-60.nfs(14731)
         |                            |               |-exportfs(14727)
         |                            |               |-grep(14728)
         |                            |               |-grep(14729)
         |                            |               `-sed(14730)
         |                            `-sh(14762)---pstree(14763)
         |-glusterd(1482)-+-{glusterd}(1483)
         |                `-{glusterd}(1617)
         |-glusterfs(1591)-+-{glusterfs}(1594)
         |                 |-{glusterfs}(1595)
         |                 `-{glusterfs}(1894)
         |-glusterfs(1760)-+-{glusterfs}(1761)
         |                 |-{glusterfs}(1763)
         |                 `-{glusterfs}(1895)
         |-glusterfsd(1619)-+-{glusterfsd}(1620)
         |                  |-{glusterfsd}(1622)
         |                  |-{glusterfsd}(1670)
         |                  `-{glusterfsd}(1686)
         |-glusterfsd(1623)-+-{glusterfsd}(1624)
         |                  |-{glusterfsd}(1626)
         |                  |-{glusterfsd}(1677)
         |                  `-{glusterfsd}(1689)
         |-glusterfsd(1627)-+-{glusterfsd}(1628)
         |                  |-{glusterfsd}(1630)
         |                  |-{glusterfsd}(1678)
         |                  `-{glusterfsd}(1691)
         |-glusterfsd(1631)-+-{glusterfsd}(1632)
         |                  |-{glusterfsd}(1634)
         |                  |-{glusterfsd}(1672)
         |                  `-{glusterfsd}(1683)
         |-glusterfsd(1635)-+-{glusterfsd}(1636)
         |                  |-{glusterfsd}(1638)
         |                  |-{glusterfsd}(1669)
         |                  |-{glusterfsd}(1682)
         |                  `-{glusterfsd}(25543)
         |-glusterfsd(1639)-+-{glusterfsd}(1640)
         |                  |-{glusterfsd}(1642)
         |                  |-{glusterfsd}(1674)
         |                  |-{glusterfsd}(1681)
         |                  |-{glusterfsd}(3711)
         |                  |-{glusterfsd}(5763)
         |                  `-{glusterfsd}(19444)
         |-glusterfsd(1643)-+-{glusterfsd}(1644)
         |                  |-{glusterfsd}(1646)
         |                  |-{glusterfsd}(1671)
         |                  |-{glusterfsd}(1685)
         |                  `-{glusterfsd}(22163)
         |-glusterfsd(1647)-+-{glusterfsd}(1648)
         |                  |-{glusterfsd}(1650)
         |                  |-{glusterfsd}(1673)
         |                  |-{glusterfsd}(1688)
         |                  `-{glusterfsd}(22506)
         |-glusterfsd(1651)-+-{glusterfsd}(1652)
         |                  |-{glusterfsd}(1654)
         |                  |-{glusterfsd}(1680)
         |                  `-{glusterfsd}(1684)
         |-glusterfsd(1655)-+-{glusterfsd}(1656)
         |                  |-{glusterfsd}(1658)
         |                  |-{glusterfsd}(1675)
         |                  `-{glusterfsd}(1687)
         |-glusterfsd(1659)-+-{glusterfsd}(1660)
         |                  |-{glusterfsd}(1662)
         |                  |-{glusterfsd}(1679)
         |                  `-{glusterfsd}(1692)
         |-glusterfsd(1663)-+-{glusterfsd}(1664)
         |                  |-{glusterfsd}(1666)
         |                  |-{glusterfsd}(1676)
         |                  `-{glusterfsd}(1690)
         |-master(1854)-+-pickup(14136)
         |              `-qmgr(1863)
         |-mingetty(1877)
         |-mingetty(1879)
         |-mingetty(1881)
         |-mingetty(1883)
         |-mingetty(1885)
         |-mingetty(1887)
         |-rpc.idmapd(2437)
         |-rpc.mountd(2415)
         |-rpc.rquotad(2399)
         |-rpc.statd(14722)
         |-rpcbind(1473)
         |-rsyslogd(1455)-+-{rsyslogd}(1457)
         |                |-{rsyslogd}(1458)
         |                `-{rsyslogd}(14015)
         |-sshd(1778)---sshd(1896)---bash(1900)---tail(14653)
         |-statd-callout(14604)
         `-udevd(514)-+-udevd(1893)
                      `-udevd(14725)
1: POSIX  ADVISORY  READ  14727 08:43:1441821 0 EOF
2: FLOCK  ADVISORY  WRITE 1854 08:43:1442827 0 EOF
3: FLOCK  ADVISORY  WRITE 1854 08:43:1442762 0 EOF
4: POSIX  ADVISORY  WRITE 1663 08:43:262794 0 EOF
5: POSIX  ADVISORY  WRITE 1659 08:43:262791 0 EOF
6: POSIX  ADVISORY  WRITE 1655 08:43:262772 0 EOF
7: POSIX  ADVISORY  WRITE 1651 08:43:262770 0 EOF
8: POSIX  ADVISORY  WRITE 1647 08:43:262769 0 EOF
9: POSIX  ADVISORY  WRITE 1643 08:43:262768 0 EOF
10: POSIX  ADVISORY  WRITE 1639 08:43:262767 0 EOF
11: POSIX  ADVISORY  WRITE 1635 08:43:262766 0 EOF
12: POSIX  ADVISORY  WRITE 1631 08:43:262765 0 EOF
13: POSIX  ADVISORY  WRITE 1627 08:43:262764 0 EOF
14: POSIX  ADVISORY  WRITE 1623 08:43:262602 0 EOF
15: POSIX  ADVISORY  WRITE 1619 08:43:262598 0 EOF
16: FLOCK  ADVISORY  WRITE 1472 08:43:1441940 0 EOF
/var/ctdb/:
total 8
1442877 drwxr-xr-x. 2 root root 4096 Jan 31 15:13 persistent
1442880 drwx------. 2 root root 4096 Jan 31 15:13 state
/var/ctdb/persistent:
total 0
[root at s2 ~]#

Additional cluster info follows.
Let me know if you require anything else.

Using one test NFS user mount works:
mount -t nfs stor.v.c.XX.XX:/mnt/primary_dr test_nfs/
[root at mc1 mnt]# df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/xvda               100791      1747     93924   2% /
tmpfs                     1000         0      1000   0% /dev/shm
stor.v.c.XX.XX:/mnt/primary_dr
                        1906819      2507   1904312   1% /mnt/test_nfs
it also works when added to XenServer pool.

Have two gluster volumes mounted locally on each node for re-export via NFS:
/mnt/primary_dr
/mnt/lockers

Firewalls currently disabled.

Relevant options from ctdb (the same on both nodes s2 & s3):
CTDB_RECOVERY_LOCK="/mnt/lockers/lock/lock_file"
[root at s2 ~]# ls -la /mnt/lockers/lock/lock_file
-rw-r--r--. 1 root root 0 Jan 31 15:30 /mnt/lockers/lock/lock_file
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
[root at s2 ~]# cat /etc/ctdb/public_addresses
10.31.0.41/24
10.31.0.43/24
CTDB_PUBLIC_INTERFACE=bond0
# default is to not manage Samba
# CTDB_MANAGES_SAMBA=yes
# CTDB_NODES=/etc/ctdb/nodes
[root at s2 ~]# cat /etc/ctdb/nodes
10.31.0.40
10.31.0.42
#added for a fix, which didn't work.
CTDB_SET_EVENTSCRIPTTIMEOUTCOUNT=44

RR DNS:
[root at s2 ~]# host stor.v.c.XX.XX
stor.v.c.XX.XX has address 10.31.0.43
stor.v.c.XX.XX has address 10.31.0.41

Exports:
[root at s2 ~]# cat /etc/exports
/mnt/primary_dr 10.31.0.0/24(rw,no_root_squash,fsid=1,async)

Relevant options from nfs:
CTDB_MANAGES_NFS=yes
NFS_TICKLE_SHARED_DIRECTORY=/mnt/lockers/tickles
[root at s2 ~]# ls -la /mnt/lockers/tickles
drwxr-xr-x. 2 root root  41 Jan 31 21:21 s2.c.XX.XX
drwxr-xr-x. 2 root root  24 Feb  1 11:11 s3.c.XX.XX
STATD_SHARED_DIRECTORY=/mnt/lockers/statd
[root at s2 ~]# ls -la /mnt/lockers/statd
-rw-r--r--. 1 root root   4 Jan 31 21:31 state
NFS_HOSTNAME="stor.v.c.XX.XX"
STATD_HOSTNAME="$NFS_HOSTNAME -H /etc/ctdb/statd-callout -p 97"
#STATD_HOSTNAME="$NFS_HOSTNAME -P "$STATD_SHARED_DIRECTORY/$PUBLIC_IP" 
-H /etc/ctdb/statd-callout -p 97"
RPCNFSDARGS="--no-nfs-version 4"
RQUOTAD_PORT=598
LOCKD_TCPPORT=599
LOCKD_UDPPORT=599
MOUNTD_PORT=597
STATD_PORT=595
STATD_OUTGOING_PORT=596

Processes on after reboot:
root      1871  0.0  0.0   8864  2620 ?        S<Ls 12:52   0:00 
/usr/sbin/ctdbd --reclock=/mnt/lockers/lock/lock_fi
root      1921  0.0  0.0   8760   680 ?        S<   12:52   0:00  \_ 
/usr/sbin/ctdbd --reclock=/mnt/lockers/lock/loc
root      2444  0.0  0.0 105256   480 ?        S<s  12:52   0:00 
rpc.rquotad -p 598
root      2460  0.0  0.0  21348   944 ?        S<s  12:52   0:00 
rpc.mountd -p 597
root      2482  0.0  0.0  27436   588 ?        S<s  12:52   0:00 rpc.idmapd
rpcuser   2660  0.0  0.0  23200  1212 ?        S<s  12:52   0:00 
rpc.statd -n stor.v.c.XX.XX -H /etc/ctdb/statd-cal

Log of startup of s3 before s2 joins cluster:

2012/02/01 12:52:01.738705 [ 1871]: Starting CTDBD as pid : 1871
2012/02/01 12:52:01.851167 [ 1871]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
2012/02/01 12:52:02.845493 [ 1871]: Freeze priority 1
2012/02/01 12:52:02.845530 [ 1871]: Freeze priority 2
2012/02/01 12:52:02.845552 [ 1871]: Freeze priority 3
2012/02/01 12:52:06.847557 [ 1921]: Taking out recovery lock from 
recovery daemon
2012/02/01 12:52:06.847591 [ 1921]: Take the recovery lock
2012/02/01 12:52:06.855576 [ 1921]: Recovery lock taken successfully
2012/02/01 12:52:06.855641 [ 1921]: Recovery lock taken successfully by 
recovery daemon
2012/02/01 12:52:06.855674 [ 1871]: Freeze priority 1
2012/02/01 12:52:06.855708 [ 1871]: Freeze priority 2
2012/02/01 12:52:06.855729 [ 1871]: Freeze priority 3
2012/02/01 12:52:06.938634 [ 1871]: Thawing priority 1
2012/02/01 12:52:06.938650 [ 1871]: Release freeze handler for prio 1
2012/02/01 12:52:06.938661 [ 1871]: Thawing priority 2
2012/02/01 12:52:06.938667 [ 1871]: Release freeze handler for prio 2
2012/02/01 12:52:06.938674 [ 1871]: Thawing priority 3
2012/02/01 12:52:06.938678 [ 1871]: Release freeze handler for prio 3
2012/02/01 12:52:07.023948 [ 1921]: Resetting ban count to 0 for all nodes
2012/02/01 12:52:18.025556 [ 1921]: Trigger takeoverrun
2012/02/01 12:52:22.034612 [ 1921]: Trigger takeoverrun
2012/02/01 12:52:22.174894 [ 1871]: Starting NFS statd: [  OK  ]
2012/02/01 12:52:22.234749 [ 1871]: Sending NFS tickle ack for 
10.31.0.43 to 10.31.0.21:986
2012/02/01 12:52:22.238716 [ 1871]: Sending NFS tickle ack for 
10.31.0.43 to 10.31.0.24:1009
2012/02/01 12:52:22.248584 [ 1871]: Starting NFS services:  [  OK  ]
2012/02/01 12:52:22.257733 [ 1871]: Starting NFS quotas: [  OK  ]
2012/02/01 12:52:22.268410 [ 1871]: Starting NFS daemon: [  OK  ]
2012/02/01 12:52:22.273655 [ 1871]: Starting NFS mountd: [  OK  ]
2012/02/01 12:52:22.308992 [ 1871]: Starting RPC idmapd: [  OK  ]
2012/02/01 12:52:23.259697 [ 1921]: Trigger takeoverrun
2012/02/01 12:52:27.448730 [ 1871]: Node became HEALTHY. Ask recovery 
master 1 to perform ip reallocation
2012/02/01 12:55:13.949859 [ 1871]: Event script timed out : 60.nfs 
monitor  count : 0  pid : 3767
2012/02/01 12:55:13.949971 [ 1871]: server/eventscript.c:539 Sending 
SIGTERM to child pid:3767
2012/02/01 12:55:13.950043 [ 3767]: Timed out running script 
'/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :3767
2012/02/01 12:55:13.984596 [ 3767]: Logged timedout eventscript : { 
pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } 
 >/tmp/ctdb.event.20120201125513.3767

After S2 joins cluster:
2012/02/01 12:56:25.697150 [ 1921]: Taking out recovery lock from 
recovery daemon
2012/02/01 12:56:25.697196 [ 1921]: Take the recovery lock
2012/02/01 12:56:25.699737 [ 1921]: Recovery lock taken successfully
2012/02/01 12:56:25.699774 [ 1921]: Recovery lock taken successfully by 
recovery daemon
2012/02/01 12:56:25.700013 [ 1871]: Freeze priority 1
2012/02/01 12:56:25.700182 [ 1871]: Freeze priority 2
2012/02/01 12:56:25.700310 [ 1871]: Freeze priority 3
2012/02/01 12:56:25.784428 [ 1871]: Thawing priority 1
2012/02/01 12:56:25.784450 [ 1871]: Release freeze handler for prio 1
2012/02/01 12:56:25.784466 [ 1871]: Thawing priority 2
2012/02/01 12:56:25.784473 [ 1871]: Release freeze handler for prio 2
2012/02/01 12:56:25.784483 [ 1871]: Thawing priority 3
2012/02/01 12:56:25.784490 [ 1871]: Release freeze handler for prio 3
2012/02/01 12:56:25.874375 [ 1921]: Resetting ban count to 0 for all nodes
2012/02/01 12:56:36.878858 [ 1921]: server/ctdb_recoverd.c:3269 Remote 
node:0 has different flags for node 1. It has 0x02 vs our 0x02
2012/02/01 12:56:36.878890 [ 1921]: Use flags 0x00 from local recmaster 
node for cluster update of node 1 flags
2012/02/01 12:56:36.879144 [ 1921]: Taking out recovery lock from 
recovery daemon
2012/02/01 12:56:36.879161 [ 1921]: Take the recovery lock
2012/02/01 12:56:36.880979 [ 1921]: Recovery lock taken successfully
2012/02/01 12:56:36.881031 [ 1921]: Recovery lock taken successfully by 
recovery daemon
2012/02/01 12:56:36.881272 [ 1871]: Freeze priority 1
2012/02/01 12:56:36.881549 [ 1871]: Freeze priority 2
2012/02/01 12:56:36.881782 [ 1871]: Freeze priority 3
2012/02/01 12:56:36.967153 [ 1871]: Thawing priority 1
2012/02/01 12:56:36.967178 [ 1871]: Release freeze handler for prio 1
2012/02/01 12:56:36.967193 [ 1871]: Thawing priority 2
2012/02/01 12:56:36.967199 [ 1871]: Release freeze handler for prio 2
2012/02/01 12:56:36.967207 [ 1871]: Thawing priority 3
2012/02/01 12:56:36.967214 [ 1871]: Release freeze handler for prio 3
2012/02/01 12:56:37.052788 [ 1921]: Resetting ban count to 0 for all nodes
2012/02/01 12:56:59.353112 [ 1921]: Trigger takeoverrun

s2 startup:
2012/02/01 12:51:01.722577 [ 1854]: Starting CTDBD as pid : 1854
2012/02/01 12:51:01.833680 [ 1854]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
2012/02/01 12:51:06.559672 [ 1854]: Freeze priority 1
2012/02/01 12:51:06.559836 [ 1854]: Freeze priority 2
2012/02/01 12:51:06.559965 [ 1854]: Freeze priority 3
2012/02/01 12:51:06.644067 [ 1854]: Thawing priority 1
2012/02/01 12:51:06.644108 [ 1854]: Release freeze handler for prio 1
2012/02/01 12:51:06.644120 [ 1854]: Thawing priority 2
2012/02/01 12:51:06.644125 [ 1854]: Release freeze handler for prio 2
2012/02/01 12:51:06.644135 [ 1854]: Thawing priority 3
2012/02/01 12:51:06.644209 [ 1854]: Release freeze handler for prio 3
2012/02/01 12:51:06.831875 [ 1904]: Trigger takeoverrun
2012/02/01 12:51:17.736531 [ 1854]: server/ctdb_monitor.c:472 Node 1 
became healthy - force recovery for startup
2012/02/01 12:51:17.738801 [ 1854]: Freeze priority 1
2012/02/01 12:51:17.739081 [ 1854]: Freeze priority 2
2012/02/01 12:51:17.739313 [ 1854]: Freeze priority 3
2012/02/01 12:51:17.824675 [ 1854]: Thawing priority 1
2012/02/01 12:51:17.824724 [ 1854]: Release freeze handler for prio 1
2012/02/01 12:51:17.824735 [ 1854]: Thawing priority 2
2012/02/01 12:51:17.824740 [ 1854]: Release freeze handler for prio 2
2012/02/01 12:51:17.824747 [ 1854]: Thawing priority 3
2012/02/01 12:51:17.824752 [ 1854]: Release freeze handler for prio 3
2012/02/01 12:51:32.881409 [ 1904]: Trigger takeoverrun
2012/02/01 12:51:33.150668 [ 1854]: Starting NFS statd: [  OK  ]
2012/02/01 12:51:33.221791 [ 1854]: Starting NFS services:  [  OK  ]
2012/02/01 12:51:33.226639 [ 1854]: Starting NFS quotas: [  OK  ]
2012/02/01 12:51:33.244140 [ 1854]: Starting NFS daemon: [  OK  ]
2012/02/01 12:51:33.249638 [ 1854]: Starting NFS mountd: [  OK  ]
2012/02/01 12:51:33.285769 [ 1854]: Starting RPC idmapd: [  OK  ]
2012/02/01 12:51:38.535450 [ 1854]: Node became HEALTHY. Ask recovery 
master 1 to perform ip reallocation
2012/02/01 12:51:39.896648 [ 1904]: Trigger takeoverrun

RPCinfo:
[root at s3 ~]# rpcinfo -u localhost 100024 1
program 100024 version 1 ready and waiting
[root at s2 ~]# rpcinfo -u localhost 100024 1
program 100024 version 1 ready and waiting

Also IPs are up:

[root at s3 ~]# ip a l | grep 10.31.
     inet 10.31.0.42/24 brd 10.31.0.255 scope global bond0
     inet 10.31.0.41/24 brd 10.31.0.255 scope global secondary bond0
[root at s2 ~]# ip a l | grep 10.31.
     inet 10.31.0.40/8 brd 10.255.255.255 scope global bond0
     inet 10.31.0.43/24 brd 10.31.0.255 scope global bond0

Some cluster status:
[root at s2 ~]# ctdb status
Number of nodes:2
pnn:0 10.31.0.40       OK (THIS NODE)
pnn:1 10.31.0.42       OK
Generation:1286608009
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:1

Services started automagically:
[root at s2 ~]# chkconfig --list
auditd         	0:off	1:off	2:on	3:on	4:on	5:on	6:off
crond          	0:off	1:off	2:on	3:on	4:on	5:on	6:off
glusterd       	0:off	1:off	2:off	3:on	4:off	5:on	6:off
ip6tables      	0:off	1:off	2:on	3:on	4:on	5:on	6:off
netconsole     	0:off	1:off	2:off	3:off	4:off	5:off	6:off
netfs          	0:off	1:off	2:off	3:on	4:on	5:on	6:off
network        	0:off	1:off	2:on	3:on	4:on	5:on	6:off
postfix        	0:off	1:off	2:on	3:on	4:on	5:on	6:off
rdisc          	0:off	1:off	2:off	3:off	4:off	5:off	6:off
restorecond    	0:off	1:off	2:off	3:off	4:off	5:off	6:off
rpcbind        	0:off	1:off	2:on	3:on	4:on	5:on	6:off
rsyslog        	0:off	1:off	2:on	3:on	4:on	5:on	6:off
saslauthd      	0:off	1:off	2:off	3:off	4:off	5:off	6:off
sshd           	0:off	1:off	2:on	3:on	4:on	5:on	6:off
udev-post      	0:off	1:on	2:on	3:on	4:on	5:on	6:off

Also cron check for ctdb:
[root at s3 ~]# crontab -l
* * * * * /etc/init.d/ctdb cron > /dev/null 2>&1

It seems to me that there are multiple errors with event.d scripts, like 
not detecting already running rpc.statd and similar.

Thank you for your help.
All suggestion are welcome, even for payed support. ;-)\
Regards,
-- 
Martin Gombač
ISG d.o.o.
00386 (0)1 620 75 03