Setting up CTDB on OCFS2 and VMs ...

Tue Dec 16 12:05:27 MST 2014

On Tue, Dec 16, 2014 at 10:22 AM, Rowland Penny <repenny241155 at gmail.com> wrote:
> On 16/12/14 17:38, Stefan Kania wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi Rowland,
>>
>>
>> Am 16.12.14 um 15:27 schrieb Rowland Penny:
>>>
>>> On 16/12/14 13:12, Stefan Kania wrote: Hi Rowland,
>>>
>>> If these addresses should be your IPs vor Clients accessing the
>>> Cluster, you must put tese IPs in your public_addresses file
>>>
>>>> OK, they are in the public_addresses file
>>>
>>> and in your DNS the hostname vor your cluster should point to both
>>> addresses.
>>>
>>>> DOH! CNAME.
>>>
>>> Remeber that you have ti install "ethtool" on all nodes!
>>>
>>>> What do mean remember? I haven't seen it anywhere that you must
>>>> install ethtool, It is installed anyway :-D
>>
>> I had an error-massage on the node without ethtool, and the nodes were
>> unhealty. After I installaed ethtool it worked for me and the
>> error-message was gone.
>>>
>>> So if you start the cluster the system will pick an IP-Address out
>>> of the file. If there is no public_addresses file your system will
>>> not get any IP. If there is no ethtool but a public_address file,
>>> the node cant set on of the IPs. If one node of your clusters fails
>>> the second not will get the IP address from the failed host. BUT
>>> REMEBER you wount see the IPs with "ifconfig" you MUST use "ip a
>>> l".
>>>
>>>> That is something else that I haven't seen anywhere! :-)
>>
>> Read this:
>>
>> http://unix.stackexchange.com/questions/93412/difference-between-ifconfig-and-ip-commands
>>
>> What I think ist that CTDB is assaigning the virtual IP over "ip" and
>> configuring the NIC with ethtool. So if you are setting an IP with the
>> "ip" command, this address is not shown with "ifconfig"
>>
>> Did you get rid of the IP-Errormessage?
>
>
> It would seem so, the last time it appeared in log.ctdb was here:
>
> 2014/12/16 14:48:41.784044 [recoverd:13666]: Takeover run starting
> 2014/12/16 14:48:41.784284 [recoverd:13666]: Failed to find node to cover ip
> 192.168.0.9
> 2014/12/16 14:48:41.784305 [recoverd:13666]: Failed to find node to cover ip
> 192.168.0.8
> 2014/12/16 14:48:41.850344 [recoverd:13666]: Takeover run completed
> successfully
>

If this only happens during startup I would not worry about it.
It may be that none of the nodes are ready to accept IP addresses yet
and this is then just a benign but annoying message.

> A short while later there is this:
>
> 2014/12/16 14:52:34.242911 [recoverd:13666]: Takeover run starting
> 2014/12/16 14:52:34.243356 [13513]: Takeover of IP 192.168.0.9/8 on
> interface eth0
> 2014/12/16 14:52:34.261916 [13513]: Takeover of IP 192.168.0.8/8 on
> interface eth0

This looks wrong.
I suspect you want this to be using /24 bit netmasks, not 8 bit masks.
See also below in the 'ip addr show' output where you see the mask
beeing /24 for the static address.

I.e.  you should probably change your public addresses file and set
the netmask to 24

> 2014/12/16 14:52:34.490010 [recoverd:13666]: Takeover run completed
> successfully
>
> The ipaddresses never appear again.
>
>>
>> I think that's your main problem.
>
>
> I dont think so, tailing the log shows this:
>
> root at cluster1:~# tail /var/log/ctdb/log.ctdb
> 2014/12/16 18:11:23.866612 [13513]: Thawing priority 2
> 2014/12/16 18:11:23.866634 [13513]: Release freeze handler for prio 2
> 2014/12/16 18:11:23.866666 [13513]: Thawing priority 3
> 2014/12/16 18:11:23.866685 [13513]: Release freeze handler for prio 3
> 2014/12/16 18:11:23.873189 [recoverd:13666]: ctdb_control error: 'managed to
> lock reclock file from inside daemon'
> 2014/12/16 18:11:23.873235 [recoverd:13666]: ctdb_control error: 'managed to
> lock reclock file from inside daemon'
> 2014/12/16 18:11:23.873246 [recoverd:13666]: Async operation failed with
> ret=-1 res=-1 opcode=16
> 2014/12/16 18:11:23.873254 [recoverd:13666]: Async wait failed -
> fail_count=1
> 2014/12/16 18:11:23.873261 [recoverd:13666]: server/ctdb_recoverd.c:412
> Unable to set recovery mode. Recovery failed.
> 2014/12/16 18:11:23.873268 [recoverd:13666]: server/ctdb_recoverd.c:1996
> Unable to set recovery mode to normal on cluster
>
> This appears to be happening over and over again.
>
> ctdb status shows this:
>
> Number of nodes:3 (including 1 deleted nodes)
> pnn:1 192.168.1.10     OK (THIS NODE)
> pnn:2 192.168.1.11     UNHEALTHY

You can run 'ctdb scriptstatus'   on node 1  and it should give you
more detail about why the node is unhealthy.

> Generation:1226492970
> Size:2
> hash:0 lmaster:1
> hash:1 lmaster:2
> Recovery mode:NORMAL (0)
> Recovery master:1
>
> ip a l
> Shows this:
>
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN group
> default
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
> UP group default qlen 1000
>     link/ether 08:00:27:d6:92:30 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.0.6/24 brd 192.168.0.255 scope global eth0
>     inet 192.168.0.8/8 brd 192.255.255.255 scope global eth0
>     inet 192.168.0.9/8 brd 192.255.255.255 scope global secondary eth0
>     inet6 fe80::a00:27ff:fed6:9230/64 scope link
>        valid_lft forever preferred_lft forever
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
> UP group default qlen 1000
>     link/ether 08:00:27:03:79:17 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.1.10/24 brd 192.168.1.255 scope global eth1
>     inet6 fe80::a00:27ff:fe03:7917/64 scope link
>        valid_lft forever preferred_lft forever
>
> Rowland
>
>
>>
>> Stefan
>>
>>>> Rowland
>>>
>>> Stefan
>>>
>>>
>>> Am 16.12.14 um 10:30 schrieb Rowland Penny:
>>>>>>
>>>>>> On 16/12/14 07:53, Stefan Kania wrote:
>>>>>>>
>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>>>>>
>>>>>>> Hi Rowland,
>>>>>>>
>>>>>>> did you see that you have som Problems with IPs on node 1?
>>>>>>> 2014/12/15 16:32:28.300370 [recoverd: 2497]: Failed to find
>>>>>>> node to cover ip 192.168.0.9 2014/12/15 16:32:28.300412
>>>>>>> [recoverd: 2497]: Failed to find node to cover ip
>>>>>>> 192.168.0.8 I also had some problems with IP and
>>>>>>> nameresolutions at the beginning. After I solved that
>>>>>>> problem everything was fine.
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I did wonder about those lines, I do not have 192.168.0.8 &
>>>>>> 192.168.0.9, but Ronnie posted this:
>>>>>>
>>>>>> No, you should not/need not create them on the system. Ctdbd
>>>>>> will create and assign these addresses automatically and
>>>>>> dynamically while the cluster is running.
>>>>>>
>>>>>> So, do I need to create them and if so, where? This is one
>>>>>> of those areas of CTDB that doesn't seem to documented at
>>>>>> all.
>>>>>>
>>>>>> Rowland
>>>>>>
>>> -- Stefan Kania Landweg 13 25693 St. Michaelisdonn
>>>
>>>
>>> Signieren jeder E-Mail hilft Spam zu reduzieren. Signieren Sie
>>> ihre E-Mail. Weiter Informationen unter http://www.gnupg.org
>>>
>>> Mein Schlüssel liegt auf
>>>
>>> hkp://subkeys.pgp.net
>>>
>>>
>> - -- Stefan Kania
>> Landweg 13
>> 25693 St. Michaelisdonn
>>
>>
>> Signieren jeder E-Mail hilft Spam zu reduzieren. Signieren Sie ihre
>> E-Mail. Weiter Informationen unter http://www.gnupg.org
>>
>> Mein Schlüssel liegt auf
>>
>> hkp://subkeys.pgp.net
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.16 (Darwin)
>>
>> iEUEARECAAYFAlSQbhcACgkQ2JOGcNAHDTYthwCWJWPKLQRHCKGPKfTIcD6M/NSy
>> UACghab8M/tgslaBgc6Ynk0D0jshjJA=
>> =66WA
>> -----END PGP SIGNATURE-----
>
>