CTDB fail-over fails, NFS only, glusterfs backend, for XenServer VMs

Thu Feb 2 13:26:36 MST 2012

Hi,

10 minutes must surely be something wrong?
That is very very long.

If you need it, I can change the maximum allowed timeout  from 100 to
1000 seconds,
but,  even 100 seconds is very very long for something to become unresponsive.
If I do this change, it will be for the master branch and not one of
the older branches, so you would have to recompile
ctdbd from sources.

A better option could probably be to
1, remove the reclock file completely from /etc/sysconfig/ctdb. You
can run ctdb without a reclock file at all if you have problems with
locking on your filesystem. It means ctdb will no longer try to lock a
file there and will thus no longer detect the reclock file ptroblem
and go unhealthy.
On the other hand, it also means that you no longer have split-brain
protection,   so   be careful ...
2, tweak the NFS detection in the /etc/ctdb/evennts.d/60.nfs script.
The script that comes in master can be modified to change it to
arbitrarily high valuies for when it will "detect nfs is dead and
trigger UNHEALTHY/failover"
Again, a drawback by setting it higher is that you increase the time
it takes for ctdb to detect a problem and trigger a recovery by moving
clients off the node and onto another node.

/etc/ctdb/events.d/60.nfs :
...
        # check that NFS responds to rpc requests
        if [ "$CTDB_NFS_SKIP_KNFSD_ALIVE_CHECK" != "yes" ] ; then
            nfs_check_rpc_service "knfsd" \
                -ge 6 "verbose unhealthy" \
                -eq 4 "verbose restart" \
                -eq 2 "restart:bs"
        fi
...

the line -ge 6   tells the script that IF nfs has been see dead for 6
monitor checks in a row then make the node unheaqlthy.
Monitoring happen between 1-10 seconds apart depending on node status
so it is hard to set it to "exactly xyz seconds",
but you can experiment setting it much higher.

These two changes, remove the reclock file, and the changes to
experiment with 60.nfs  should allow tweaking the system so that it
will survive across the node failure without making the node
unhealthy.

But, 10 minutes does sound like something is wrong in the underlying system.
I dont have experience with gluster myself, but I think you should try
to find out why/what it takes this long for your failover/recovery
first,
and then try the two tweaks above later if nothting else can be done.

regards
ronnie sahlberg

On Fri, Feb 3, 2012 at 2:26 AM, Martin Gombac <martin at isg.si> wrote:
> Just an update.
> Volume for locks took more than 10 minutes to become available again using
> gluster as a back-end. This is why NFS was not responding and why ctdb
> marked itself unhealthy.
>
> [root at s3 bricks]# time df -m
>
> Filesystem           1M-blocks      Used Available Use% Mounted on
> s3.c.XX.XX:/lockers   1906819      7282   1899538   1% /mnt/lockers
> real    11m58.676s
>
> I would also like to note that the maximum value that ctdb
> EventScriptTimeout uses IS 100 seconds, even if you set it to more, like 300
> seconds. I wish it could be set to more, so we could use dogly gluster fs.
> :-)
>
> Regards,
>
> Martin Gombač
> ISG d.o.o.
> 00386 (0)1 620 75 03
>
> On 2/2/12 2:36 PM, Martin Gombac wrote:
>>
>> Hi Ronnie,
>>
>> thank you for your explanation. It was what was needed. GlusterFS takes
>> some time to recover when i kill one node. I will try and increase
>> timeout values for ctdb and hopefully it will not mark itself as
>> unhealthy.
>>
>> What scalable redundant FS would you suggest besides gluster? By
>> scalable i mean it can extend over many network attached nodes and by
>> redundant i mean it can survive loss of a node or a spindle.
>>
>> Thank you.
>> Martin
>>
>> On 2/2/12 10:26 AM, ronnie sahlberg wrote:
>>>
>>> 2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
>>> 2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc
>>> requests
>>> 2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask
>>> recovery master 0 to perform ip reallocation
>>> 2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive
>>> failures for 01.reclock, marking cluster unhealthy
>>> 2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive
>>> failures for 01.reclock, marking cluster unhealthy
>>>
>>> Looks like your underlying filesystem has died.
>>> You took out one node and your filesystem has hung, causing knfsd to
>>> hand inside the kernel.
>>>
>>> knfsd hung causing the first two lines of warning
>>>
>>> The last two lines of log are from ctdb itself and also indicate that
>>> the filesystem used for the recovery lock file has hung.
>>>
>>>
>>> Nothing ctdb can do here when your filesystem is dodgy.
>>> Try a different filesystem. Other filesystems might work better.
>>>
>>>
>>> regards
>>> ronnie sahlberg
>>>
>>>
>>> On Thu, Feb 2, 2012 at 8:15 PM, Orlando Richards
>>> <orlando.richards at ed.ac.uk> wrote:
>>>>
>>>> Hi Martin,
>>>>
>>>>
>>>>
>>>> On -10/01/37 20:59, Martin Gombac wrote:
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> i have a problem with my CTDB 2 node NFS cluster while nodes take over
>>>>> foreign resources.
>>>>>
>>>>> Only NFS HA, 2 nodes for now, will grow later.
>>>>> CentOS 6.2: 2.6.32-220.4.1.el6.x86_64
>>>>> CTDB: 1.0.114.3
>>>>> Gluster 3.2.5 for shared storage.
>>>>> XenServer 5.6 SP2 for NFS client (= 5.*).
>>>>>
>>>>>
>>>>> Once both nodes are up, the NFS service works using any of the two
>>>>> nodes
>>>>> concurrently (round robin DNS). But once i shutdown one NFS storage
>>>>> node, everything fails. :-)
>>>>
>>>>
>>>>
>>>> When you say "shut down", what do you mean? Do you mean shutting down
>>>> ctdb
>>>> (service ctdb stop), or doing a shutdown of the node?
>>>>
>>>> The reason I ask is because the event script timeout below looks to
>>>> me like
>>>> the filesystem has paused (possibly due to shutting down one of the file
>>>> system servers, thus causing it to go into recovery mode), and then
>>>> the NFS
>>>> script which writes the tickle info to the filesystem is timing out.
>>>> We see
>>>> this even with GPFS under certain circumstances. Some of the GPFS
>>>> timeouts
>>>> are set to 60 seconds, so as a workaround I've done:
>>>>
>>>> ctdb setvar EventScriptTimeout=90
>>>>
>>>> and added:
>>>> CTDB_SET_EventScriptTimeout=90
>>>> to /etc/sysconfig/ctdb
>>>>
>>>> (changing from the default of 30). We've only just started noticing this
>>>> though (previously, we didn't manage NFS with CTDB), so I don't know
>>>> if this
>>>> tweak will work or not, or what the repercussions of such a large
>>>> timeout
>>>> will be!
>>>>
>>>> Hope that helps,
>>>>
>>>> --
>>>> Orlando
>>>>
>>>>
>>>>> Here is what log says when shutting down s3 node and keeping s2 up.
>>>>>
>>>>>
>>>>> 2012/02/01 15:08:03.219127 [ 1983]: server/ctdb_recoverd.c:3295 The
>>>>> vnnmap count is different from the number of active nodes. 2 vs 1
>>>>> 2012/02/01 15:08:03.219159 [ 1983]: Taking out recovery lock from
>>>>> recovery daemon
>>>>> 2012/02/01 15:08:03.219166 [ 1983]: Take the recovery lock
>>>>> 2012/02/01 15:08:03.223519 [ 1983]: Recovery lock taken successfully
>>>>> 2012/02/01 15:08:03.223552 [ 1983]: Recovery lock taken successfully by
>>>>> recovery daemon
>>>>> 2012/02/01 15:08:03.223610 [ 1933]: Freeze priority 1
>>>>> 2012/02/01 15:08:03.223735 [ 1933]: Freeze priority 2
>>>>> 2012/02/01 15:08:03.223836 [ 1933]: Freeze priority 3
>>>>> 2012/02/01 15:08:03.307360 [ 1933]: Thawing priority 1
>>>>> 2012/02/01 15:08:03.307379 [ 1933]: Release freeze handler for prio 1
>>>>> 2012/02/01 15:08:03.307389 [ 1933]: Thawing priority 2
>>>>> 2012/02/01 15:08:03.307394 [ 1933]: Release freeze handler for prio 2
>>>>> 2012/02/01 15:08:03.307401 [ 1933]: Thawing priority 3
>>>>> 2012/02/01 15:08:03.307406 [ 1933]: Release freeze handler for prio 3
>>>>> 2012/02/01 15:08:03.402667 [ 1933]: Sending NFS tickle ack for
>>>>> 10.31.0.41 to 10.31.0.21:980
>>>>> 2012/02/01 15:08:03.405015 [ 1933]: Sending NFS tickle ack for
>>>>> 10.31.0.41 to 10.31.0.22:1018
>>>>> 2012/02/01 15:08:03.506316 [ 1983]: Resetting ban count to 0 for all
>>>>> nodes
>>>>> 2012/02/01 15:08:14.508640 [ 1983]: Trigger takeoverrun
>>>>> 2012/02/01 15:08:16.267190 [ 1933]: rpcinfo: RPC: Program not
>>>>> registered
>>>>> 2012/02/01 15:08:16.272316 [ 1933]: ERROR: STATD is not responding.
>>>>> Trying to restart it. [rpc.statd -n stor.v.c.XX.XX -H
>>>>> /etc/ctdb/statd-callout -p 97 -p 595 -o 596]
>>>>> 2012/02/01 15:08:46.177912 [ 1933]: Event script timed out : 60.nfs
>>>>> monitor count : 0 pid : 14718
>>>>> 2012/02/01 15:08:46.177957 [ 1933]: server/eventscript.c:539 Sending
>>>>> SIGTERM to child pid:14718
>>>>> 2012/02/01 15:08:46.178012 [14718]: Timed out running script
>>>>> '/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :14718
>>>>> 2012/02/01 15:08:46.193493 [14718]: Logged timedout eventscript : {
>>>>> pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; }
>>>>> >/tmp/ctdb.event.20120201150846.14718
>>>>> 2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
>>>>> 2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc
>>>>> requests
>>>>> 2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask recovery
>>>>> master 0 to perform ip reallocation
>>>>> 2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:09:31.343297 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:09:46.360558 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:10:01.377255 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:10:16.394231 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:10:31.411283 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:10:46.428437 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:11:01.445068 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:11:16.462270 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:11:31.479486 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:11:46.496276 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:12:01.512953 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:12:16.530080 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:12:31.547281 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:12:46.564218 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:13:01.580857 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:13:16.597282 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:13:31.614385 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> ....
>>>>> 2012/02/01 15:22:02.194870 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:22:17.212204 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:22:32.229226 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:22:47.246081 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:23:02.262845 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:23:17.280197 [ 1933]: ERROR: more than 3 consecutive
>>>>> failures for 01.reclock, marking cluster unhealthy
>>>>> 2012/02/01 15:23:32.456560 [ 1933]: Node became HEALTHY. Ask recovery
>>>>> master 0 to perform ip reallocation
>>>>>
>>>>>
>>>>>
>>>>> Here is cat of that debug file:
>>>>> [root at s2 ~]# cat /tmp/ctdb.event.20120201150846.14718
>>>>> init(1)-+-01.reclock(14666)---stat(14668)
>>>>> |-auditd(1439)---{auditd}(1440)
>>>>> |-crond(1864)
>>>>> |-ctdbd(1933)-+-ctdbd(1983)
>>>>> | `-ctdbd(14718)-+-60.nfs(14719)-+-60.nfs(14731)
>>>>> | | |-exportfs(14727)
>>>>> | | |-grep(14728)
>>>>> | | |-grep(14729)
>>>>> | | `-sed(14730)
>>>>> | `-sh(14762)---pstree(14763)
>>>>> |-glusterd(1482)-+-{glusterd}(1483)
>>>>> | `-{glusterd}(1617)
>>>>> |-glusterfs(1591)-+-{glusterfs}(1594)
>>>>> | |-{glusterfs}(1595)
>>>>> | `-{glusterfs}(1894)
>>>>> |-glusterfs(1760)-+-{glusterfs}(1761)
>>>>> | |-{glusterfs}(1763)
>>>>> | `-{glusterfs}(1895)
>>>>> |-glusterfsd(1619)-+-{glusterfsd}(1620)
>>>>> | |-{glusterfsd}(1622)
>>>>> | |-{glusterfsd}(1670)
>>>>> | `-{glusterfsd}(1686)
>>>>> |-glusterfsd(1623)-+-{glusterfsd}(1624)
>>>>> | |-{glusterfsd}(1626)
>>>>> | |-{glusterfsd}(1677)
>>>>> | `-{glusterfsd}(1689)
>>>>> |-glusterfsd(1627)-+-{glusterfsd}(1628)
>>>>> | |-{glusterfsd}(1630)
>>>>> | |-{glusterfsd}(1678)
>>>>> | `-{glusterfsd}(1691)
>>>>> |-glusterfsd(1631)-+-{glusterfsd}(1632)
>>>>> | |-{glusterfsd}(1634)
>>>>> | |-{glusterfsd}(1672)
>>>>> | `-{glusterfsd}(1683)
>>>>> |-glusterfsd(1635)-+-{glusterfsd}(1636)
>>>>> | |-{glusterfsd}(1638)
>>>>> | |-{glusterfsd}(1669)
>>>>> | |-{glusterfsd}(1682)
>>>>> | `-{glusterfsd}(25543)
>>>>> |-glusterfsd(1639)-+-{glusterfsd}(1640)
>>>>> | |-{glusterfsd}(1642)
>>>>> | |-{glusterfsd}(1674)
>>>>> | |-{glusterfsd}(1681)
>>>>> | |-{glusterfsd}(3711)
>>>>> | |-{glusterfsd}(5763)
>>>>> | `-{glusterfsd}(19444)
>>>>> |-glusterfsd(1643)-+-{glusterfsd}(1644)
>>>>> | |-{glusterfsd}(1646)
>>>>> | |-{glusterfsd}(1671)
>>>>> | |-{glusterfsd}(1685)
>>>>> | `-{glusterfsd}(22163)
>>>>> |-glusterfsd(1647)-+-{glusterfsd}(1648)
>>>>> | |-{glusterfsd}(1650)
>>>>> | |-{glusterfsd}(1673)
>>>>> | |-{glusterfsd}(1688)
>>>>> | `-{glusterfsd}(22506)
>>>>> |-glusterfsd(1651)-+-{glusterfsd}(1652)
>>>>> | |-{glusterfsd}(1654)
>>>>> | |-{glusterfsd}(1680)
>>>>> | `-{glusterfsd}(1684)
>>>>> |-glusterfsd(1655)-+-{glusterfsd}(1656)
>>>>> | |-{glusterfsd}(1658)
>>>>> | |-{glusterfsd}(1675)
>>>>> | `-{glusterfsd}(1687)
>>>>> |-glusterfsd(1659)-+-{glusterfsd}(1660)
>>>>> | |-{glusterfsd}(1662)
>>>>> | |-{glusterfsd}(1679)
>>>>> | `-{glusterfsd}(1692)
>>>>> |-glusterfsd(1663)-+-{glusterfsd}(1664)
>>>>> | |-{glusterfsd}(1666)
>>>>> | |-{glusterfsd}(1676)
>>>>> | `-{glusterfsd}(1690)
>>>>> |-master(1854)-+-pickup(14136)
>>>>> | `-qmgr(1863)
>>>>> |-mingetty(1877)
>>>>> |-mingetty(1879)
>>>>> |-mingetty(1881)
>>>>> |-mingetty(1883)
>>>>> |-mingetty(1885)
>>>>> |-mingetty(1887)
>>>>> |-rpc.idmapd(2437)
>>>>> |-rpc.mountd(2415)
>>>>> |-rpc.rquotad(2399)
>>>>> |-rpc.statd(14722)
>>>>> |-rpcbind(1473)
>>>>> |-rsyslogd(1455)-+-{rsyslogd}(1457)
>>>>> | |-{rsyslogd}(1458)
>>>>> | `-{rsyslogd}(14015)
>>>>> |-sshd(1778)---sshd(1896)---bash(1900)---tail(14653)
>>>>> |-statd-callout(14604)
>>>>> `-udevd(514)-+-udevd(1893)
>>>>> `-udevd(14725)
>>>>> 1: POSIX ADVISORY READ 14727 08:43:1441821 0 EOF
>>>>> 2: FLOCK ADVISORY WRITE 1854 08:43:1442827 0 EOF
>>>>> 3: FLOCK ADVISORY WRITE 1854 08:43:1442762 0 EOF
>>>>> 4: POSIX ADVISORY WRITE 1663 08:43:262794 0 EOF
>>>>> 5: POSIX ADVISORY WRITE 1659 08:43:262791 0 EOF
>>>>> 6: POSIX ADVISORY WRITE 1655 08:43:262772 0 EOF
>>>>> 7: POSIX ADVISORY WRITE 1651 08:43:262770 0 EOF
>>>>> 8: POSIX ADVISORY WRITE 1647 08:43:262769 0 EOF
>>>>> 9: POSIX ADVISORY WRITE 1643 08:43:262768 0 EOF
>>>>> 10: POSIX ADVISORY WRITE 1639 08:43:262767 0 EOF
>>>>> 11: POSIX ADVISORY WRITE 1635 08:43:262766 0 EOF
>>>>> 12: POSIX ADVISORY WRITE 1631 08:43:262765 0 EOF
>>>>> 13: POSIX ADVISORY WRITE 1627 08:43:262764 0 EOF
>>>>> 14: POSIX ADVISORY WRITE 1623 08:43:262602 0 EOF
>>>>> 15: POSIX ADVISORY WRITE 1619 08:43:262598 0 EOF
>>>>> 16: FLOCK ADVISORY WRITE 1472 08:43:1441940 0 EOF
>>>>> /var/ctdb/:
>>>>> total 8
>>>>> 1442877 drwxr-xr-x. 2 root root 4096 Jan 31 15:13 persistent
>>>>> 1442880 drwx------. 2 root root 4096 Jan 31 15:13 state
>>>>> /var/ctdb/persistent:
>>>>> total 0
>>>>> [root at s2 ~]#
>>>>>
>>>>>
>>>>> Additional cluster info follows.
>>>>> Let me know if you require anything else.
>>>>>
>>>>>
>>>>> Using one test NFS user mount works:
>>>>> mount -t nfs stor.v.c.XX.XX:/mnt/primary_dr test_nfs/
>>>>> [root at mc1 mnt]# df -m
>>>>> Filesystem 1M-blocks Used Available Use% Mounted on
>>>>> /dev/xvda 100791 1747 93924 2% /
>>>>> tmpfs 1000 0 1000 0% /dev/shm
>>>>> stor.v.c.XX.XX:/mnt/primary_dr
>>>>> 1906819 2507 1904312 1% /mnt/test_nfs
>>>>> it also works when added to XenServer pool.
>>>>>
>>>>> Have two gluster volumes mounted locally on each node for re-export via
>>>>> NFS:
>>>>> /mnt/primary_dr
>>>>> /mnt/lockers
>>>>>
>>>>> Firewalls currently disabled.
>>>>>
>>>>> Relevant options from ctdb (the same on both nodes s2& s3):
>>>>> CTDB_RECOVERY_LOCK=mnt/lockers/lock/lock_file"
>>>>>
>>>>> [root at s2 ~]# ls -la /mnt/lockers/lock/lock_file
>>>>> -rw-r--r--. 1 root root 0 Jan 31 15:30 /mnt/lockers/lock/lock_file
>>>>> CTDB_PUBLIC_ADDRESSES=tc/ctdb/public_addresses
>>>>>
>>>>> [root at s2 ~]# cat /etc/ctdb/public_addresses
>>>>> 10.31.0.41/24
>>>>> 10.31.0.43/24
>>>>> CTDB_PUBLIC_INTERFACE=nd0
>>>>>
>>>>> # default is to not manage Samba
>>>>> # CTDB_MANAGES_SAMBA=s
>>>>> # CTDB_NODES=tc/ctdb/nodes
>>>>>
>>>>> [root at s2 ~]# cat /etc/ctdb/nodes
>>>>> 10.31.0.40
>>>>> 10.31.0.42
>>>>> #added for a fix, which didn't work.
>>>>> CTDB_SET_EVENTSCRIPTTIMEOUTCOUNTD
>>>>>
>>>>>
>>>>> RR DNS:
>>>>> [root at s2 ~]# host stor.v.c.XX.XX
>>>>> stor.v.c.XX.XX has address 10.31.0.43
>>>>> stor.v.c.XX.XX has address 10.31.0.41
>>>>>
>>>>> Exports:
>>>>> [root at s2 ~]# cat /etc/exports
>>>>> /mnt/primary_dr 10.31.0.0/24(rw,no_root_squash,fsid=async)
>>>>>
>>>>> Relevant options from nfs:
>>>>> CTDB_MANAGES_NFS=s
>>>>> NFS_TICKLE_SHARED_DIRECTORY=nt/lockers/tickles
>>>>>
>>>>> [root at s2 ~]# ls -la /mnt/lockers/tickles
>>>>> drwxr-xr-x. 2 root root 41 Jan 31 21:21 s2.c.XX.XX
>>>>> drwxr-xr-x. 2 root root 24 Feb 1 11:11 s3.c.XX.XX
>>>>> STATD_SHARED_DIRECTORY=nt/lockers/statd
>>>>>
>>>>> [root at s2 ~]# ls -la /mnt/lockers/statd
>>>>> -rw-r--r--. 1 root root 4 Jan 31 21:31 state
>>>>> NFS_HOSTNAME=tor.v.c.XX.XX"
>>>>> STATD_HOSTNAME=NFS_HOSTNAME -H /etc/ctdb/statd-callout -p 97"
>>>>> #STATD_HOSTNAME=NFS_HOSTNAME -P "$STATD_SHARED_DIRECTORY/$PUBLIC_IP" -H
>>>>> /etc/ctdb/statd-callout -p 97"
>>>>> RPCNFSDARGS=-no-nfs-version 4"
>>>>> RQUOTAD_PORTY8
>>>>> LOCKD_TCPPORTY9
>>>>> LOCKD_UDPPORTY9
>>>>> MOUNTD_PORTY7
>>>>> STATD_PORTY5
>>>>> STATD_OUTGOING_PORTY6
>>>>>
>>>>>
>>>>>
>>>>> Processes on after reboot:
>>>>> root 1871 0.0 0.0 8864 2620 ? S<Ls 12:52 0:00 /usr/sbin/ctdbd
>>>>> --reclock=nt/lockers/lock/lock_fi
>>>>>
>>>>> root 1921 0.0 0.0 8760 680 ? S< 12:52 0:00 \_ /usr/sbin/ctdbd
>>>>> --reclock=nt/lockers/lock/loc
>>>>>
>>>>> root 2444 0.0 0.0 105256 480 ? S<s 12:52 0:00 rpc.rquotad -p 598
>>>>> root 2460 0.0 0.0 21348 944 ? S<s 12:52 0:00 rpc.mountd -p 597
>>>>> root 2482 0.0 0.0 27436 588 ? S<s 12:52 0:00 rpc.idmapd
>>>>> rpcuser 2660 0.0 0.0 23200 1212 ? S<s 12:52 0:00 rpc.statd -n
>>>>> stor.v.c.XX.XX -H /etc/ctdb/statd-cal
>>>>>
>>>>> Log of startup of s3 before s2 joins cluster:
>>>>>
>>>>> 2012/02/01 12:52:01.738705 [ 1871]: Starting CTDBD as pid : 1871
>>>>> 2012/02/01 12:52:01.851167 [ 1871]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
>>>>> 2012/02/01 12:52:02.845493 [ 1871]: Freeze priority 1
>>>>> 2012/02/01 12:52:02.845530 [ 1871]: Freeze priority 2
>>>>> 2012/02/01 12:52:02.845552 [ 1871]: Freeze priority 3
>>>>> 2012/02/01 12:52:06.847557 [ 1921]: Taking out recovery lock from
>>>>> recovery daemon
>>>>> 2012/02/01 12:52:06.847591 [ 1921]: Take the recovery lock
>>>>> 2012/02/01 12:52:06.855576 [ 1921]: Recovery lock taken successfully
>>>>> 2012/02/01 12:52:06.855641 [ 1921]: Recovery lock taken successfully by
>>>>> recovery daemon
>>>>> 2012/02/01 12:52:06.855674 [ 1871]: Freeze priority 1
>>>>> 2012/02/01 12:52:06.855708 [ 1871]: Freeze priority 2
>>>>> 2012/02/01 12:52:06.855729 [ 1871]: Freeze priority 3
>>>>> 2012/02/01 12:52:06.938634 [ 1871]: Thawing priority 1
>>>>> 2012/02/01 12:52:06.938650 [ 1871]: Release freeze handler for prio 1
>>>>> 2012/02/01 12:52:06.938661 [ 1871]: Thawing priority 2
>>>>> 2012/02/01 12:52:06.938667 [ 1871]: Release freeze handler for prio 2
>>>>> 2012/02/01 12:52:06.938674 [ 1871]: Thawing priority 3
>>>>> 2012/02/01 12:52:06.938678 [ 1871]: Release freeze handler for prio 3
>>>>> 2012/02/01 12:52:07.023948 [ 1921]: Resetting ban count to 0 for all
>>>>> nodes
>>>>> 2012/02/01 12:52:18.025556 [ 1921]: Trigger takeoverrun
>>>>> 2012/02/01 12:52:22.034612 [ 1921]: Trigger takeoverrun
>>>>> 2012/02/01 12:52:22.174894 [ 1871]: Starting NFS statd: [ OK ]
>>>>> 2012/02/01 12:52:22.234749 [ 1871]: Sending NFS tickle ack for
>>>>> 10.31.0.43 to 10.31.0.21:986
>>>>> 2012/02/01 12:52:22.238716 [ 1871]: Sending NFS tickle ack for
>>>>> 10.31.0.43 to 10.31.0.24:1009
>>>>> 2012/02/01 12:52:22.248584 [ 1871]: Starting NFS services: [ OK ]
>>>>> 2012/02/01 12:52:22.257733 [ 1871]: Starting NFS quotas: [ OK ]
>>>>> 2012/02/01 12:52:22.268410 [ 1871]: Starting NFS daemon: [ OK ]
>>>>> 2012/02/01 12:52:22.273655 [ 1871]: Starting NFS mountd: [ OK ]
>>>>> 2012/02/01 12:52:22.308992 [ 1871]: Starting RPC idmapd: [ OK ]
>>>>> 2012/02/01 12:52:23.259697 [ 1921]: Trigger takeoverrun
>>>>> 2012/02/01 12:52:27.448730 [ 1871]: Node became HEALTHY. Ask recovery
>>>>> master 1 to perform ip reallocation
>>>>> 2012/02/01 12:55:13.949859 [ 1871]: Event script timed out : 60.nfs
>>>>> monitor count : 0 pid : 3767
>>>>> 2012/02/01 12:55:13.949971 [ 1871]: server/eventscript.c:539 Sending
>>>>> SIGTERM to child pid:3767
>>>>> 2012/02/01 12:55:13.950043 [ 3767]: Timed out running script
>>>>> '/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :3767
>>>>> 2012/02/01 12:55:13.984596 [ 3767]: Logged timedout eventscript : {
>>>>> pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; }
>>>>> >/tmp/ctdb.event.20120201125513.3767
>>>>>
>>>>> After S2 joins cluster:
>>>>> 2012/02/01 12:56:25.697150 [ 1921]: Taking out recovery lock from
>>>>> recovery daemon
>>>>> 2012/02/01 12:56:25.697196 [ 1921]: Take the recovery lock
>>>>> 2012/02/01 12:56:25.699737 [ 1921]: Recovery lock taken successfully
>>>>> 2012/02/01 12:56:25.699774 [ 1921]: Recovery lock taken successfully by
>>>>> recovery daemon
>>>>> 2012/02/01 12:56:25.700013 [ 1871]: Freeze priority 1
>>>>> 2012/02/01 12:56:25.700182 [ 1871]: Freeze priority 2
>>>>> 2012/02/01 12:56:25.700310 [ 1871]: Freeze priority 3
>>>>> 2012/02/01 12:56:25.784428 [ 1871]: Thawing priority 1
>>>>> 2012/02/01 12:56:25.784450 [ 1871]: Release freeze handler for prio 1
>>>>> 2012/02/01 12:56:25.784466 [ 1871]: Thawing priority 2
>>>>> 2012/02/01 12:56:25.784473 [ 1871]: Release freeze handler for prio 2
>>>>> 2012/02/01 12:56:25.784483 [ 1871]: Thawing priority 3
>>>>> 2012/02/01 12:56:25.784490 [ 1871]: Release freeze handler for prio 3
>>>>> 2012/02/01 12:56:25.874375 [ 1921]: Resetting ban count to 0 for all
>>>>> nodes
>>>>> 2012/02/01 12:56:36.878858 [ 1921]: server/ctdb_recoverd.c:3269 Remote
>>>>> node:0 has different flags for node 1. It has 0x02 vs our 0x02
>>>>> 2012/02/01 12:56:36.878890 [ 1921]: Use flags 0x00 from local recmaster
>>>>> node for cluster update of node 1 flags
>>>>> 2012/02/01 12:56:36.879144 [ 1921]: Taking out recovery lock from
>>>>> recovery daemon
>>>>> 2012/02/01 12:56:36.879161 [ 1921]: Take the recovery lock
>>>>> 2012/02/01 12:56:36.880979 [ 1921]: Recovery lock taken successfully
>>>>> 2012/02/01 12:56:36.881031 [ 1921]: Recovery lock taken successfully by
>>>>> recovery daemon
>>>>> 2012/02/01 12:56:36.881272 [ 1871]: Freeze priority 1
>>>>> 2012/02/01 12:56:36.881549 [ 1871]: Freeze priority 2
>>>>> 2012/02/01 12:56:36.881782 [ 1871]: Freeze priority 3
>>>>> 2012/02/01 12:56:36.967153 [ 1871]: Thawing priority 1
>>>>> 2012/02/01 12:56:36.967178 [ 1871]: Release freeze handler for prio 1
>>>>> 2012/02/01 12:56:36.967193 [ 1871]: Thawing priority 2
>>>>> 2012/02/01 12:56:36.967199 [ 1871]: Release freeze handler for prio 2
>>>>> 2012/02/01 12:56:36.967207 [ 1871]: Thawing priority 3
>>>>> 2012/02/01 12:56:36.967214 [ 1871]: Release freeze handler for prio 3
>>>>> 2012/02/01 12:56:37.052788 [ 1921]: Resetting ban count to 0 for all
>>>>> nodes
>>>>> 2012/02/01 12:56:59.353112 [ 1921]: Trigger takeoverrun
>>>>>
>>>>> s2 startup:
>>>>> 2012/02/01 12:51:01.722577 [ 1854]: Starting CTDBD as pid : 1854
>>>>> 2012/02/01 12:51:01.833680 [ 1854]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
>>>>> 2012/02/01 12:51:06.559672 [ 1854]: Freeze priority 1
>>>>> 2012/02/01 12:51:06.559836 [ 1854]: Freeze priority 2
>>>>> 2012/02/01 12:51:06.559965 [ 1854]: Freeze priority 3
>>>>> 2012/02/01 12:51:06.644067 [ 1854]: Thawing priority 1
>>>>> 2012/02/01 12:51:06.644108 [ 1854]: Release freeze handler for prio 1
>>>>> 2012/02/01 12:51:06.644120 [ 1854]: Thawing priority 2
>>>>> 2012/02/01 12:51:06.644125 [ 1854]: Release freeze handler for prio 2
>>>>> 2012/02/01 12:51:06.644135 [ 1854]: Thawing priority 3
>>>>> 2012/02/01 12:51:06.644209 [ 1854]: Release freeze handler for prio 3
>>>>> 2012/02/01 12:51:06.831875 [ 1904]: Trigger takeoverrun
>>>>> 2012/02/01 12:51:17.736531 [ 1854]: server/ctdb_monitor.c:472 Node 1
>>>>> became healthy - force recovery for startup
>>>>> 2012/02/01 12:51:17.738801 [ 1854]: Freeze priority 1
>>>>> 2012/02/01 12:51:17.739081 [ 1854]: Freeze priority 2
>>>>> 2012/02/01 12:51:17.739313 [ 1854]: Freeze priority 3
>>>>> 2012/02/01 12:51:17.824675 [ 1854]: Thawing priority 1
>>>>> 2012/02/01 12:51:17.824724 [ 1854]: Release freeze handler for prio 1
>>>>> 2012/02/01 12:51:17.824735 [ 1854]: Thawing priority 2
>>>>> 2012/02/01 12:51:17.824740 [ 1854]: Release freeze handler for prio 2
>>>>> 2012/02/01 12:51:17.824747 [ 1854]: Thawing priority 3
>>>>> 2012/02/01 12:51:17.824752 [ 1854]: Release freeze handler for prio 3
>>>>> 2012/02/01 12:51:32.881409 [ 1904]: Trigger takeoverrun
>>>>> 2012/02/01 12:51:33.150668 [ 1854]: Starting NFS statd: [ OK ]
>>>>> 2012/02/01 12:51:33.221791 [ 1854]: Starting NFS services: [ OK ]
>>>>> 2012/02/01 12:51:33.226639 [ 1854]: Starting NFS quotas: [ OK ]
>>>>> 2012/02/01 12:51:33.244140 [ 1854]: Starting NFS daemon: [ OK ]
>>>>> 2012/02/01 12:51:33.249638 [ 1854]: Starting NFS mountd: [ OK ]
>>>>> 2012/02/01 12:51:33.285769 [ 1854]: Starting RPC idmapd: [ OK ]
>>>>> 2012/02/01 12:51:38.535450 [ 1854]: Node became HEALTHY. Ask recovery
>>>>> master 1 to perform ip reallocation
>>>>> 2012/02/01 12:51:39.896648 [ 1904]: Trigger takeoverrun
>>>>>
>>>>>
>>>>> RPCinfo:
>>>>> [root at s3 ~]# rpcinfo -u localhost 100024 1
>>>>> program 100024 version 1 ready and waiting
>>>>> [root at s2 ~]# rpcinfo -u localhost 100024 1
>>>>> program 100024 version 1 ready and waiting
>>>>>
>>>>> Also IPs are up:
>>>>>
>>>>> [root at s3 ~]# ip a l | grep 10.31.
>>>>> inet 10.31.0.42/24 brd 10.31.0.255 scope global bond0
>>>>> inet 10.31.0.41/24 brd 10.31.0.255 scope global secondary bond0
>>>>> [root at s2 ~]# ip a l | grep 10.31.
>>>>> inet 10.31.0.40/8 brd 10.255.255.255 scope global bond0
>>>>> inet 10.31.0.43/24 brd 10.31.0.255 scope global bond0
>>>>>
>>>>>
>>>>> Some cluster status:
>>>>> [root at s2 ~]# ctdb status
>>>>> Number of nodes:2
>>>>> pnn:0 10.31.0.40 OK (THIS NODE)
>>>>> pnn:1 10.31.0.42 OK
>>>>> Generation:1286608009
>>>>> Size:2
>>>>> hash:0 lmaster:0
>>>>> hash:1 lmaster:1
>>>>> Recovery mode:NORMAL (0)
>>>>> Recovery master:1
>>>>>
>>>>>
>>>>> Services started automagically:
>>>>> [root at s2 ~]# chkconfig --list
>>>>> auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> crond 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> glusterd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
>>>>> ip6tables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>>> netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
>>>>> network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> postfix 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>>> restorecond 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>>> rpcbind 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> rsyslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>>> sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>>> udev-post 0:off 1:on 2:on 3:on 4:on 5:on 6:off
>>>>>
>>>>> Also cron check for ctdb:
>>>>> [root at s3 ~]# crontab -l
>>>>> * * * * * /etc/init.d/ctdb cron> /dev/null 2>&1
>>>>>
>>>>>
>>>>> It seems to me that there are multiple errors with event.d scripts,
>>>>> like
>>>>> not detecting already running rpc.statd and similar.
>>>>>
>>>>> Thank you for your help.
>>>>> All suggestion are welcome, even for payed support. ;-)\
>>>>> Regards,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland,
>>>> with registration number SC005336.