CTDB fail-over fails, NFS only, glusterfs backend, for XenServer VMs

Martin Gombac martin at isg.si
Thu Feb 2 08:26:10 MST 2012


Just an update.
Volume for locks took more than 10 minutes to become available again 
using gluster as a back-end. This is why NFS was not responding and why 
ctdb marked itself unhealthy.

[root at s3 bricks]# time df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
s3.c.XX.XX:/lockers   1906819      7282   1899538   1% /mnt/lockers
real	11m58.676s

I would also like to note that the maximum value that ctdb 
EventScriptTimeout uses IS 100 seconds, even if you set it to more, like 
300 seconds. I wish it could be set to more, so we could use dogly 
gluster fs. :-)

Regards,
Martin Gombač
ISG d.o.o.
00386 (0)1 620 75 03

On 2/2/12 2:36 PM, Martin Gombac wrote:
> Hi Ronnie,
>
> thank you for your explanation. It was what was needed. GlusterFS takes
> some time to recover when i kill one node. I will try and increase
> timeout values for ctdb and hopefully it will not mark itself as unhealthy.
>
> What scalable redundant FS would you suggest besides gluster? By
> scalable i mean it can extend over many network attached nodes and by
> redundant i mean it can survive loss of a node or a spindle.
>
> Thank you.
> Martin
>
> On 2/2/12 10:26 AM, ronnie sahlberg wrote:
>> 2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
>> 2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc
>> requests
>> 2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask
>> recovery master 0 to perform ip reallocation
>> 2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive
>> failures for 01.reclock, marking cluster unhealthy
>> 2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive
>> failures for 01.reclock, marking cluster unhealthy
>>
>> Looks like your underlying filesystem has died.
>> You took out one node and your filesystem has hung, causing knfsd to
>> hand inside the kernel.
>>
>> knfsd hung causing the first two lines of warning
>>
>> The last two lines of log are from ctdb itself and also indicate that
>> the filesystem used for the recovery lock file has hung.
>>
>>
>> Nothing ctdb can do here when your filesystem is dodgy.
>> Try a different filesystem. Other filesystems might work better.
>>
>>
>> regards
>> ronnie sahlberg
>>
>>
>> On Thu, Feb 2, 2012 at 8:15 PM, Orlando Richards
>> <orlando.richards at ed.ac.uk> wrote:
>>> Hi Martin,
>>>
>>>
>>>
>>> On -10/01/37 20:59, Martin Gombac wrote:
>>>>
>>>> Hi All,
>>>>
>>>> i have a problem with my CTDB 2 node NFS cluster while nodes take over
>>>> foreign resources.
>>>>
>>>> Only NFS HA, 2 nodes for now, will grow later.
>>>> CentOS 6.2: 2.6.32-220.4.1.el6.x86_64
>>>> CTDB: 1.0.114.3
>>>> Gluster 3.2.5 for shared storage.
>>>> XenServer 5.6 SP2 for NFS client (= 5.*).
>>>>
>>>>
>>>> Once both nodes are up, the NFS service works using any of the two
>>>> nodes
>>>> concurrently (round robin DNS). But once i shutdown one NFS storage
>>>> node, everything fails. :-)
>>>
>>>
>>> When you say "shut down", what do you mean? Do you mean shutting down
>>> ctdb
>>> (service ctdb stop), or doing a shutdown of the node?
>>>
>>> The reason I ask is because the event script timeout below looks to
>>> me like
>>> the filesystem has paused (possibly due to shutting down one of the file
>>> system servers, thus causing it to go into recovery mode), and then
>>> the NFS
>>> script which writes the tickle info to the filesystem is timing out.
>>> We see
>>> this even with GPFS under certain circumstances. Some of the GPFS
>>> timeouts
>>> are set to 60 seconds, so as a workaround I've done:
>>>
>>> ctdb setvar EventScriptTimeout=90
>>>
>>> and added:
>>> CTDB_SET_EventScriptTimeout=90
>>> to /etc/sysconfig/ctdb
>>>
>>> (changing from the default of 30). We've only just started noticing this
>>> though (previously, we didn't manage NFS with CTDB), so I don't know
>>> if this
>>> tweak will work or not, or what the repercussions of such a large
>>> timeout
>>> will be!
>>>
>>> Hope that helps,
>>>
>>> --
>>> Orlando
>>>
>>>
>>>> Here is what log says when shutting down s3 node and keeping s2 up.
>>>>
>>>>
>>>> 2012/02/01 15:08:03.219127 [ 1983]: server/ctdb_recoverd.c:3295 The
>>>> vnnmap count is different from the number of active nodes. 2 vs 1
>>>> 2012/02/01 15:08:03.219159 [ 1983]: Taking out recovery lock from
>>>> recovery daemon
>>>> 2012/02/01 15:08:03.219166 [ 1983]: Take the recovery lock
>>>> 2012/02/01 15:08:03.223519 [ 1983]: Recovery lock taken successfully
>>>> 2012/02/01 15:08:03.223552 [ 1983]: Recovery lock taken successfully by
>>>> recovery daemon
>>>> 2012/02/01 15:08:03.223610 [ 1933]: Freeze priority 1
>>>> 2012/02/01 15:08:03.223735 [ 1933]: Freeze priority 2
>>>> 2012/02/01 15:08:03.223836 [ 1933]: Freeze priority 3
>>>> 2012/02/01 15:08:03.307360 [ 1933]: Thawing priority 1
>>>> 2012/02/01 15:08:03.307379 [ 1933]: Release freeze handler for prio 1
>>>> 2012/02/01 15:08:03.307389 [ 1933]: Thawing priority 2
>>>> 2012/02/01 15:08:03.307394 [ 1933]: Release freeze handler for prio 2
>>>> 2012/02/01 15:08:03.307401 [ 1933]: Thawing priority 3
>>>> 2012/02/01 15:08:03.307406 [ 1933]: Release freeze handler for prio 3
>>>> 2012/02/01 15:08:03.402667 [ 1933]: Sending NFS tickle ack for
>>>> 10.31.0.41 to 10.31.0.21:980
>>>> 2012/02/01 15:08:03.405015 [ 1933]: Sending NFS tickle ack for
>>>> 10.31.0.41 to 10.31.0.22:1018
>>>> 2012/02/01 15:08:03.506316 [ 1983]: Resetting ban count to 0 for all
>>>> nodes
>>>> 2012/02/01 15:08:14.508640 [ 1983]: Trigger takeoverrun
>>>> 2012/02/01 15:08:16.267190 [ 1933]: rpcinfo: RPC: Program not
>>>> registered
>>>> 2012/02/01 15:08:16.272316 [ 1933]: ERROR: STATD is not responding.
>>>> Trying to restart it. [rpc.statd -n stor.v.c.XX.XX -H
>>>> /etc/ctdb/statd-callout -p 97 -p 595 -o 596]
>>>> 2012/02/01 15:08:46.177912 [ 1933]: Event script timed out : 60.nfs
>>>> monitor count : 0 pid : 14718
>>>> 2012/02/01 15:08:46.177957 [ 1933]: server/eventscript.c:539 Sending
>>>> SIGTERM to child pid:14718
>>>> 2012/02/01 15:08:46.178012 [14718]: Timed out running script
>>>> '/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :14718
>>>> 2012/02/01 15:08:46.193493 [14718]: Logged timedout eventscript : {
>>>> pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; }
>>>> >/tmp/ctdb.event.20120201150846.14718
>>>> 2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
>>>> 2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc
>>>> requests
>>>> 2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask recovery
>>>> master 0 to perform ip reallocation
>>>> 2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:09:31.343297 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:09:46.360558 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:10:01.377255 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:10:16.394231 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:10:31.411283 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:10:46.428437 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:11:01.445068 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:11:16.462270 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:11:31.479486 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:11:46.496276 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:12:01.512953 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:12:16.530080 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:12:31.547281 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:12:46.564218 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:13:01.580857 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:13:16.597282 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:13:31.614385 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> ....
>>>> 2012/02/01 15:22:02.194870 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:22:17.212204 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:22:32.229226 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:22:47.246081 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:23:02.262845 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:23:17.280197 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:23:32.456560 [ 1933]: Node became HEALTHY. Ask recovery
>>>> master 0 to perform ip reallocation
>>>>
>>>>
>>>>
>>>> Here is cat of that debug file:
>>>> [root at s2 ~]# cat /tmp/ctdb.event.20120201150846.14718
>>>> init(1)-+-01.reclock(14666)---stat(14668)
>>>> |-auditd(1439)---{auditd}(1440)
>>>> |-crond(1864)
>>>> |-ctdbd(1933)-+-ctdbd(1983)
>>>> | `-ctdbd(14718)-+-60.nfs(14719)-+-60.nfs(14731)
>>>> | | |-exportfs(14727)
>>>> | | |-grep(14728)
>>>> | | |-grep(14729)
>>>> | | `-sed(14730)
>>>> | `-sh(14762)---pstree(14763)
>>>> |-glusterd(1482)-+-{glusterd}(1483)
>>>> | `-{glusterd}(1617)
>>>> |-glusterfs(1591)-+-{glusterfs}(1594)
>>>> | |-{glusterfs}(1595)
>>>> | `-{glusterfs}(1894)
>>>> |-glusterfs(1760)-+-{glusterfs}(1761)
>>>> | |-{glusterfs}(1763)
>>>> | `-{glusterfs}(1895)
>>>> |-glusterfsd(1619)-+-{glusterfsd}(1620)
>>>> | |-{glusterfsd}(1622)
>>>> | |-{glusterfsd}(1670)
>>>> | `-{glusterfsd}(1686)
>>>> |-glusterfsd(1623)-+-{glusterfsd}(1624)
>>>> | |-{glusterfsd}(1626)
>>>> | |-{glusterfsd}(1677)
>>>> | `-{glusterfsd}(1689)
>>>> |-glusterfsd(1627)-+-{glusterfsd}(1628)
>>>> | |-{glusterfsd}(1630)
>>>> | |-{glusterfsd}(1678)
>>>> | `-{glusterfsd}(1691)
>>>> |-glusterfsd(1631)-+-{glusterfsd}(1632)
>>>> | |-{glusterfsd}(1634)
>>>> | |-{glusterfsd}(1672)
>>>> | `-{glusterfsd}(1683)
>>>> |-glusterfsd(1635)-+-{glusterfsd}(1636)
>>>> | |-{glusterfsd}(1638)
>>>> | |-{glusterfsd}(1669)
>>>> | |-{glusterfsd}(1682)
>>>> | `-{glusterfsd}(25543)
>>>> |-glusterfsd(1639)-+-{glusterfsd}(1640)
>>>> | |-{glusterfsd}(1642)
>>>> | |-{glusterfsd}(1674)
>>>> | |-{glusterfsd}(1681)
>>>> | |-{glusterfsd}(3711)
>>>> | |-{glusterfsd}(5763)
>>>> | `-{glusterfsd}(19444)
>>>> |-glusterfsd(1643)-+-{glusterfsd}(1644)
>>>> | |-{glusterfsd}(1646)
>>>> | |-{glusterfsd}(1671)
>>>> | |-{glusterfsd}(1685)
>>>> | `-{glusterfsd}(22163)
>>>> |-glusterfsd(1647)-+-{glusterfsd}(1648)
>>>> | |-{glusterfsd}(1650)
>>>> | |-{glusterfsd}(1673)
>>>> | |-{glusterfsd}(1688)
>>>> | `-{glusterfsd}(22506)
>>>> |-glusterfsd(1651)-+-{glusterfsd}(1652)
>>>> | |-{glusterfsd}(1654)
>>>> | |-{glusterfsd}(1680)
>>>> | `-{glusterfsd}(1684)
>>>> |-glusterfsd(1655)-+-{glusterfsd}(1656)
>>>> | |-{glusterfsd}(1658)
>>>> | |-{glusterfsd}(1675)
>>>> | `-{glusterfsd}(1687)
>>>> |-glusterfsd(1659)-+-{glusterfsd}(1660)
>>>> | |-{glusterfsd}(1662)
>>>> | |-{glusterfsd}(1679)
>>>> | `-{glusterfsd}(1692)
>>>> |-glusterfsd(1663)-+-{glusterfsd}(1664)
>>>> | |-{glusterfsd}(1666)
>>>> | |-{glusterfsd}(1676)
>>>> | `-{glusterfsd}(1690)
>>>> |-master(1854)-+-pickup(14136)
>>>> | `-qmgr(1863)
>>>> |-mingetty(1877)
>>>> |-mingetty(1879)
>>>> |-mingetty(1881)
>>>> |-mingetty(1883)
>>>> |-mingetty(1885)
>>>> |-mingetty(1887)
>>>> |-rpc.idmapd(2437)
>>>> |-rpc.mountd(2415)
>>>> |-rpc.rquotad(2399)
>>>> |-rpc.statd(14722)
>>>> |-rpcbind(1473)
>>>> |-rsyslogd(1455)-+-{rsyslogd}(1457)
>>>> | |-{rsyslogd}(1458)
>>>> | `-{rsyslogd}(14015)
>>>> |-sshd(1778)---sshd(1896)---bash(1900)---tail(14653)
>>>> |-statd-callout(14604)
>>>> `-udevd(514)-+-udevd(1893)
>>>> `-udevd(14725)
>>>> 1: POSIX ADVISORY READ 14727 08:43:1441821 0 EOF
>>>> 2: FLOCK ADVISORY WRITE 1854 08:43:1442827 0 EOF
>>>> 3: FLOCK ADVISORY WRITE 1854 08:43:1442762 0 EOF
>>>> 4: POSIX ADVISORY WRITE 1663 08:43:262794 0 EOF
>>>> 5: POSIX ADVISORY WRITE 1659 08:43:262791 0 EOF
>>>> 6: POSIX ADVISORY WRITE 1655 08:43:262772 0 EOF
>>>> 7: POSIX ADVISORY WRITE 1651 08:43:262770 0 EOF
>>>> 8: POSIX ADVISORY WRITE 1647 08:43:262769 0 EOF
>>>> 9: POSIX ADVISORY WRITE 1643 08:43:262768 0 EOF
>>>> 10: POSIX ADVISORY WRITE 1639 08:43:262767 0 EOF
>>>> 11: POSIX ADVISORY WRITE 1635 08:43:262766 0 EOF
>>>> 12: POSIX ADVISORY WRITE 1631 08:43:262765 0 EOF
>>>> 13: POSIX ADVISORY WRITE 1627 08:43:262764 0 EOF
>>>> 14: POSIX ADVISORY WRITE 1623 08:43:262602 0 EOF
>>>> 15: POSIX ADVISORY WRITE 1619 08:43:262598 0 EOF
>>>> 16: FLOCK ADVISORY WRITE 1472 08:43:1441940 0 EOF
>>>> /var/ctdb/:
>>>> total 8
>>>> 1442877 drwxr-xr-x. 2 root root 4096 Jan 31 15:13 persistent
>>>> 1442880 drwx------. 2 root root 4096 Jan 31 15:13 state
>>>> /var/ctdb/persistent:
>>>> total 0
>>>> [root at s2 ~]#
>>>>
>>>>
>>>> Additional cluster info follows.
>>>> Let me know if you require anything else.
>>>>
>>>>
>>>> Using one test NFS user mount works:
>>>> mount -t nfs stor.v.c.XX.XX:/mnt/primary_dr test_nfs/
>>>> [root at mc1 mnt]# df -m
>>>> Filesystem 1M-blocks Used Available Use% Mounted on
>>>> /dev/xvda 100791 1747 93924 2% /
>>>> tmpfs 1000 0 1000 0% /dev/shm
>>>> stor.v.c.XX.XX:/mnt/primary_dr
>>>> 1906819 2507 1904312 1% /mnt/test_nfs
>>>> it also works when added to XenServer pool.
>>>>
>>>> Have two gluster volumes mounted locally on each node for re-export via
>>>> NFS:
>>>> /mnt/primary_dr
>>>> /mnt/lockers
>>>>
>>>> Firewalls currently disabled.
>>>>
>>>> Relevant options from ctdb (the same on both nodes s2& s3):
>>>> CTDB_RECOVERY_LOCK=mnt/lockers/lock/lock_file"
>>>>
>>>> [root at s2 ~]# ls -la /mnt/lockers/lock/lock_file
>>>> -rw-r--r--. 1 root root 0 Jan 31 15:30 /mnt/lockers/lock/lock_file
>>>> CTDB_PUBLIC_ADDRESSES=tc/ctdb/public_addresses
>>>>
>>>> [root at s2 ~]# cat /etc/ctdb/public_addresses
>>>> 10.31.0.41/24
>>>> 10.31.0.43/24
>>>> CTDB_PUBLIC_INTERFACE=nd0
>>>>
>>>> # default is to not manage Samba
>>>> # CTDB_MANAGES_SAMBA=s
>>>> # CTDB_NODES=tc/ctdb/nodes
>>>>
>>>> [root at s2 ~]# cat /etc/ctdb/nodes
>>>> 10.31.0.40
>>>> 10.31.0.42
>>>> #added for a fix, which didn't work.
>>>> CTDB_SET_EVENTSCRIPTTIMEOUTCOUNTD
>>>>
>>>>
>>>> RR DNS:
>>>> [root at s2 ~]# host stor.v.c.XX.XX
>>>> stor.v.c.XX.XX has address 10.31.0.43
>>>> stor.v.c.XX.XX has address 10.31.0.41
>>>>
>>>> Exports:
>>>> [root at s2 ~]# cat /etc/exports
>>>> /mnt/primary_dr 10.31.0.0/24(rw,no_root_squash,fsid=async)
>>>>
>>>> Relevant options from nfs:
>>>> CTDB_MANAGES_NFS=s
>>>> NFS_TICKLE_SHARED_DIRECTORY=nt/lockers/tickles
>>>>
>>>> [root at s2 ~]# ls -la /mnt/lockers/tickles
>>>> drwxr-xr-x. 2 root root 41 Jan 31 21:21 s2.c.XX.XX
>>>> drwxr-xr-x. 2 root root 24 Feb 1 11:11 s3.c.XX.XX
>>>> STATD_SHARED_DIRECTORY=nt/lockers/statd
>>>>
>>>> [root at s2 ~]# ls -la /mnt/lockers/statd
>>>> -rw-r--r--. 1 root root 4 Jan 31 21:31 state
>>>> NFS_HOSTNAME=tor.v.c.XX.XX"
>>>> STATD_HOSTNAME=NFS_HOSTNAME -H /etc/ctdb/statd-callout -p 97"
>>>> #STATD_HOSTNAME=NFS_HOSTNAME -P "$STATD_SHARED_DIRECTORY/$PUBLIC_IP" -H
>>>> /etc/ctdb/statd-callout -p 97"
>>>> RPCNFSDARGS=-no-nfs-version 4"
>>>> RQUOTAD_PORTY8
>>>> LOCKD_TCPPORTY9
>>>> LOCKD_UDPPORTY9
>>>> MOUNTD_PORTY7
>>>> STATD_PORTY5
>>>> STATD_OUTGOING_PORTY6
>>>>
>>>>
>>>>
>>>> Processes on after reboot:
>>>> root 1871 0.0 0.0 8864 2620 ? S<Ls 12:52 0:00 /usr/sbin/ctdbd
>>>> --reclock=nt/lockers/lock/lock_fi
>>>>
>>>> root 1921 0.0 0.0 8760 680 ? S< 12:52 0:00 \_ /usr/sbin/ctdbd
>>>> --reclock=nt/lockers/lock/loc
>>>>
>>>> root 2444 0.0 0.0 105256 480 ? S<s 12:52 0:00 rpc.rquotad -p 598
>>>> root 2460 0.0 0.0 21348 944 ? S<s 12:52 0:00 rpc.mountd -p 597
>>>> root 2482 0.0 0.0 27436 588 ? S<s 12:52 0:00 rpc.idmapd
>>>> rpcuser 2660 0.0 0.0 23200 1212 ? S<s 12:52 0:00 rpc.statd -n
>>>> stor.v.c.XX.XX -H /etc/ctdb/statd-cal
>>>>
>>>> Log of startup of s3 before s2 joins cluster:
>>>>
>>>> 2012/02/01 12:52:01.738705 [ 1871]: Starting CTDBD as pid : 1871
>>>> 2012/02/01 12:52:01.851167 [ 1871]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
>>>> 2012/02/01 12:52:02.845493 [ 1871]: Freeze priority 1
>>>> 2012/02/01 12:52:02.845530 [ 1871]: Freeze priority 2
>>>> 2012/02/01 12:52:02.845552 [ 1871]: Freeze priority 3
>>>> 2012/02/01 12:52:06.847557 [ 1921]: Taking out recovery lock from
>>>> recovery daemon
>>>> 2012/02/01 12:52:06.847591 [ 1921]: Take the recovery lock
>>>> 2012/02/01 12:52:06.855576 [ 1921]: Recovery lock taken successfully
>>>> 2012/02/01 12:52:06.855641 [ 1921]: Recovery lock taken successfully by
>>>> recovery daemon
>>>> 2012/02/01 12:52:06.855674 [ 1871]: Freeze priority 1
>>>> 2012/02/01 12:52:06.855708 [ 1871]: Freeze priority 2
>>>> 2012/02/01 12:52:06.855729 [ 1871]: Freeze priority 3
>>>> 2012/02/01 12:52:06.938634 [ 1871]: Thawing priority 1
>>>> 2012/02/01 12:52:06.938650 [ 1871]: Release freeze handler for prio 1
>>>> 2012/02/01 12:52:06.938661 [ 1871]: Thawing priority 2
>>>> 2012/02/01 12:52:06.938667 [ 1871]: Release freeze handler for prio 2
>>>> 2012/02/01 12:52:06.938674 [ 1871]: Thawing priority 3
>>>> 2012/02/01 12:52:06.938678 [ 1871]: Release freeze handler for prio 3
>>>> 2012/02/01 12:52:07.023948 [ 1921]: Resetting ban count to 0 for all
>>>> nodes
>>>> 2012/02/01 12:52:18.025556 [ 1921]: Trigger takeoverrun
>>>> 2012/02/01 12:52:22.034612 [ 1921]: Trigger takeoverrun
>>>> 2012/02/01 12:52:22.174894 [ 1871]: Starting NFS statd: [ OK ]
>>>> 2012/02/01 12:52:22.234749 [ 1871]: Sending NFS tickle ack for
>>>> 10.31.0.43 to 10.31.0.21:986
>>>> 2012/02/01 12:52:22.238716 [ 1871]: Sending NFS tickle ack for
>>>> 10.31.0.43 to 10.31.0.24:1009
>>>> 2012/02/01 12:52:22.248584 [ 1871]: Starting NFS services: [ OK ]
>>>> 2012/02/01 12:52:22.257733 [ 1871]: Starting NFS quotas: [ OK ]
>>>> 2012/02/01 12:52:22.268410 [ 1871]: Starting NFS daemon: [ OK ]
>>>> 2012/02/01 12:52:22.273655 [ 1871]: Starting NFS mountd: [ OK ]
>>>> 2012/02/01 12:52:22.308992 [ 1871]: Starting RPC idmapd: [ OK ]
>>>> 2012/02/01 12:52:23.259697 [ 1921]: Trigger takeoverrun
>>>> 2012/02/01 12:52:27.448730 [ 1871]: Node became HEALTHY. Ask recovery
>>>> master 1 to perform ip reallocation
>>>> 2012/02/01 12:55:13.949859 [ 1871]: Event script timed out : 60.nfs
>>>> monitor count : 0 pid : 3767
>>>> 2012/02/01 12:55:13.949971 [ 1871]: server/eventscript.c:539 Sending
>>>> SIGTERM to child pid:3767
>>>> 2012/02/01 12:55:13.950043 [ 3767]: Timed out running script
>>>> '/etc/ctdb/events.d/60.nfs monitor ' after 29.9 seconds pid :3767
>>>> 2012/02/01 12:55:13.984596 [ 3767]: Logged timedout eventscript : {
>>>> pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; }
>>>> >/tmp/ctdb.event.20120201125513.3767
>>>>
>>>> After S2 joins cluster:
>>>> 2012/02/01 12:56:25.697150 [ 1921]: Taking out recovery lock from
>>>> recovery daemon
>>>> 2012/02/01 12:56:25.697196 [ 1921]: Take the recovery lock
>>>> 2012/02/01 12:56:25.699737 [ 1921]: Recovery lock taken successfully
>>>> 2012/02/01 12:56:25.699774 [ 1921]: Recovery lock taken successfully by
>>>> recovery daemon
>>>> 2012/02/01 12:56:25.700013 [ 1871]: Freeze priority 1
>>>> 2012/02/01 12:56:25.700182 [ 1871]: Freeze priority 2
>>>> 2012/02/01 12:56:25.700310 [ 1871]: Freeze priority 3
>>>> 2012/02/01 12:56:25.784428 [ 1871]: Thawing priority 1
>>>> 2012/02/01 12:56:25.784450 [ 1871]: Release freeze handler for prio 1
>>>> 2012/02/01 12:56:25.784466 [ 1871]: Thawing priority 2
>>>> 2012/02/01 12:56:25.784473 [ 1871]: Release freeze handler for prio 2
>>>> 2012/02/01 12:56:25.784483 [ 1871]: Thawing priority 3
>>>> 2012/02/01 12:56:25.784490 [ 1871]: Release freeze handler for prio 3
>>>> 2012/02/01 12:56:25.874375 [ 1921]: Resetting ban count to 0 for all
>>>> nodes
>>>> 2012/02/01 12:56:36.878858 [ 1921]: server/ctdb_recoverd.c:3269 Remote
>>>> node:0 has different flags for node 1. It has 0x02 vs our 0x02
>>>> 2012/02/01 12:56:36.878890 [ 1921]: Use flags 0x00 from local recmaster
>>>> node for cluster update of node 1 flags
>>>> 2012/02/01 12:56:36.879144 [ 1921]: Taking out recovery lock from
>>>> recovery daemon
>>>> 2012/02/01 12:56:36.879161 [ 1921]: Take the recovery lock
>>>> 2012/02/01 12:56:36.880979 [ 1921]: Recovery lock taken successfully
>>>> 2012/02/01 12:56:36.881031 [ 1921]: Recovery lock taken successfully by
>>>> recovery daemon
>>>> 2012/02/01 12:56:36.881272 [ 1871]: Freeze priority 1
>>>> 2012/02/01 12:56:36.881549 [ 1871]: Freeze priority 2
>>>> 2012/02/01 12:56:36.881782 [ 1871]: Freeze priority 3
>>>> 2012/02/01 12:56:36.967153 [ 1871]: Thawing priority 1
>>>> 2012/02/01 12:56:36.967178 [ 1871]: Release freeze handler for prio 1
>>>> 2012/02/01 12:56:36.967193 [ 1871]: Thawing priority 2
>>>> 2012/02/01 12:56:36.967199 [ 1871]: Release freeze handler for prio 2
>>>> 2012/02/01 12:56:36.967207 [ 1871]: Thawing priority 3
>>>> 2012/02/01 12:56:36.967214 [ 1871]: Release freeze handler for prio 3
>>>> 2012/02/01 12:56:37.052788 [ 1921]: Resetting ban count to 0 for all
>>>> nodes
>>>> 2012/02/01 12:56:59.353112 [ 1921]: Trigger takeoverrun
>>>>
>>>> s2 startup:
>>>> 2012/02/01 12:51:01.722577 [ 1854]: Starting CTDBD as pid : 1854
>>>> 2012/02/01 12:51:01.833680 [ 1854]: Set EVENTSCRIPTTIMEOUTCOUNT to 44
>>>> 2012/02/01 12:51:06.559672 [ 1854]: Freeze priority 1
>>>> 2012/02/01 12:51:06.559836 [ 1854]: Freeze priority 2
>>>> 2012/02/01 12:51:06.559965 [ 1854]: Freeze priority 3
>>>> 2012/02/01 12:51:06.644067 [ 1854]: Thawing priority 1
>>>> 2012/02/01 12:51:06.644108 [ 1854]: Release freeze handler for prio 1
>>>> 2012/02/01 12:51:06.644120 [ 1854]: Thawing priority 2
>>>> 2012/02/01 12:51:06.644125 [ 1854]: Release freeze handler for prio 2
>>>> 2012/02/01 12:51:06.644135 [ 1854]: Thawing priority 3
>>>> 2012/02/01 12:51:06.644209 [ 1854]: Release freeze handler for prio 3
>>>> 2012/02/01 12:51:06.831875 [ 1904]: Trigger takeoverrun
>>>> 2012/02/01 12:51:17.736531 [ 1854]: server/ctdb_monitor.c:472 Node 1
>>>> became healthy - force recovery for startup
>>>> 2012/02/01 12:51:17.738801 [ 1854]: Freeze priority 1
>>>> 2012/02/01 12:51:17.739081 [ 1854]: Freeze priority 2
>>>> 2012/02/01 12:51:17.739313 [ 1854]: Freeze priority 3
>>>> 2012/02/01 12:51:17.824675 [ 1854]: Thawing priority 1
>>>> 2012/02/01 12:51:17.824724 [ 1854]: Release freeze handler for prio 1
>>>> 2012/02/01 12:51:17.824735 [ 1854]: Thawing priority 2
>>>> 2012/02/01 12:51:17.824740 [ 1854]: Release freeze handler for prio 2
>>>> 2012/02/01 12:51:17.824747 [ 1854]: Thawing priority 3
>>>> 2012/02/01 12:51:17.824752 [ 1854]: Release freeze handler for prio 3
>>>> 2012/02/01 12:51:32.881409 [ 1904]: Trigger takeoverrun
>>>> 2012/02/01 12:51:33.150668 [ 1854]: Starting NFS statd: [ OK ]
>>>> 2012/02/01 12:51:33.221791 [ 1854]: Starting NFS services: [ OK ]
>>>> 2012/02/01 12:51:33.226639 [ 1854]: Starting NFS quotas: [ OK ]
>>>> 2012/02/01 12:51:33.244140 [ 1854]: Starting NFS daemon: [ OK ]
>>>> 2012/02/01 12:51:33.249638 [ 1854]: Starting NFS mountd: [ OK ]
>>>> 2012/02/01 12:51:33.285769 [ 1854]: Starting RPC idmapd: [ OK ]
>>>> 2012/02/01 12:51:38.535450 [ 1854]: Node became HEALTHY. Ask recovery
>>>> master 1 to perform ip reallocation
>>>> 2012/02/01 12:51:39.896648 [ 1904]: Trigger takeoverrun
>>>>
>>>>
>>>> RPCinfo:
>>>> [root at s3 ~]# rpcinfo -u localhost 100024 1
>>>> program 100024 version 1 ready and waiting
>>>> [root at s2 ~]# rpcinfo -u localhost 100024 1
>>>> program 100024 version 1 ready and waiting
>>>>
>>>> Also IPs are up:
>>>>
>>>> [root at s3 ~]# ip a l | grep 10.31.
>>>> inet 10.31.0.42/24 brd 10.31.0.255 scope global bond0
>>>> inet 10.31.0.41/24 brd 10.31.0.255 scope global secondary bond0
>>>> [root at s2 ~]# ip a l | grep 10.31.
>>>> inet 10.31.0.40/8 brd 10.255.255.255 scope global bond0
>>>> inet 10.31.0.43/24 brd 10.31.0.255 scope global bond0
>>>>
>>>>
>>>> Some cluster status:
>>>> [root at s2 ~]# ctdb status
>>>> Number of nodes:2
>>>> pnn:0 10.31.0.40 OK (THIS NODE)
>>>> pnn:1 10.31.0.42 OK
>>>> Generation:1286608009
>>>> Size:2
>>>> hash:0 lmaster:0
>>>> hash:1 lmaster:1
>>>> Recovery mode:NORMAL (0)
>>>> Recovery master:1
>>>>
>>>>
>>>> Services started automagically:
>>>> [root at s2 ~]# chkconfig --list
>>>> auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> crond 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> glusterd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
>>>> ip6tables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>> netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
>>>> network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> postfix 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>> restorecond 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>> rpcbind 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> rsyslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
>>>> sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>>>> udev-post 0:off 1:on 2:on 3:on 4:on 5:on 6:off
>>>>
>>>> Also cron check for ctdb:
>>>> [root at s3 ~]# crontab -l
>>>> * * * * * /etc/init.d/ctdb cron> /dev/null 2>&1
>>>>
>>>>
>>>> It seems to me that there are multiple errors with event.d scripts,
>>>> like
>>>> not detecting already running rpc.statd and similar.
>>>>
>>>> Thank you for your help.
>>>> All suggestion are welcome, even for payed support. ;-)\
>>>> Regards,
>>>
>>>
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland,
>>> with registration number SC005336.


More information about the samba-technical mailing list