ctdb, haproxy, and ip_nonlocal_bind

Thu Jan 30 15:34:11 UTC 2020

Follow up - got it working.

The problem I was having was that the ordering of the IP addresses in the
"nodes" file was different on each of the nodes.  I didn't think that the
ordering would matter if I had specified "node address" in the ctdb.conf
[global] config section, but apparently it's important.

Now both nodes are in OK status and I can mount shares through an IP
address managed by haproxy.

Thanks for the help!

On Thu, Jan 30, 2020 at 9:42 AM Wyllys Ingersoll <
wyllys.ingersoll at keepertech.com> wrote:

> Updated to 4.10.13, and now it doesn't get into an endless loop of failure
> like before and flood the logs, but the 2nd node still isn't able to rejoin
> with the first, the following logs appear every minute or so.
> Note the recovery lock is on the shared filesystem (/cephfs).  The first
> node shows "OK" status for node1 and UNHEALTHY  for node 2, but the node2
> says that both are UNHEALTHY.
>
> 2020/01/30 09:38:16.811921 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is dead: 0 connected
> 2020/01/30 09:38:16.811999 ctdbd[174621]: Tearing down connection to dead
> node :1
> 2020/01/30 09:38:16.812527 ctdb-recoverd[174724]: ctdb_control error:
> 'node is disconnected'
> 2020/01/30 09:38:16.812554 ctdb-recoverd[174724]: ctdb_control error:
> 'node is disconnected'
> 2020/01/30 09:38:16.812568 ctdb-recoverd[174724]: Async operation failed
> with ret=-1 res=-1 opcode=80
> 2020/01/30 09:38:16.812577 ctdb-recoverd[174724]: Async wait failed -
> fail_count=1
> 2020/01/30 09:38:16.812585 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_client.c:1920 Failed to read node capabilities.
> 2020/01/30 09:38:16.812593 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
> 2020/01/30 09:38:16.812600 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:2756 Unable to update node capabilities.
> 2020/01/30 09:38:16.813127 ctdbd[174621]: 192.168.113.14:4379: connected
> to 192.168.113.13:4379 - 1 connected
> 2020/01/30 09:38:16.813258 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is dead: 0 connected
> 2020/01/30 09:38:16.813279 ctdbd[174621]: Tearing down connection to dead
> node :1
> 2020/01/30 09:38:16.814436 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
> 2020/01/30 09:38:16.814479 ctdb-recoverd[174724]: Attempting to take
> recovery lock (/cephfs/ctdb/.ctdb.lock)
> 2020/01/30 09:38:16.816249 ctdbd[174621]: 192.168.113.14:4379: connected
> to 192.168.113.13:4379 - 1 connected
> 2020/01/30 09:38:16.816507 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is dead: 0 connected
> 2020/01/30 09:38:16.816532 ctdbd[174621]: Tearing down connection to dead
> node :1
> 2020/01/30 09:38:16.817054 ctdbd[174621]: 192.168.113.14:4379: connected
> to 192.168.113.13:4379 - 1 connected
> 2020/01/30 09:38:16.817555 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is dead: 0 connected
> 2020/01/30 09:38:16.817580 ctdbd[174621]: Tearing down connection to dead
> node :1
> 2020/01/30 09:38:16.819106 ctdbd[174621]: 192.168.113.14:4379: connected
> to 192.168.113.13:4379 - 1 connected
> 2020/01/30 09:38:16.819276 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is dead: 0 connected
> 2020/01/30 09:38:16.819324 ctdbd[174621]: Tearing down connection to dead
> node :1
> 2020/01/30 09:38:16.819825 ctdbd[174621]: 192.168.113.14:4379: node
> 192.168.113.13:4379 is already marked disconnected: 0 connected
> 2020/01/30 09:38:16.823817 ctdb-recoverd[174724]: Unable to take recovery
> lock - contention
> 2020/01/30 09:38:16.823972 ctdb-recoverd[174724]: Unable to take recovery
> lock
> 2020/01/30 09:38:16.824173 ctdb-recoverd[174724]: Retrying recovery
> 2020/01/30 09:38:17.816679 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
> 2020/01/30 09:38:17.816731 ctdb-recoverd[174724]: Attempting to take
> recovery lock (/cephfs/ctdb/.ctdb.lock)
> 2020/01/30 09:38:17.824619 ctdb-recoverd[174724]: Unable to take recovery
> lock - contention
> 2020/01/30 09:38:17.824907 ctdb-recoverd[174724]: Unable to take recovery
> lock
> 2020/01/30 09:38:17.825121 ctdb-recoverd[174724]: Retrying recovery
> 2020/01/30 09:38:18.818706 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
> 2020/01/30 09:38:18.818752 ctdb-recoverd[174724]: Attempting to take
> recovery lock (/cephfs/ctdb/.ctdb.lock)
> 2020/01/30 09:38:18.824518 ctdb-recoverd[174724]: Unable to take recovery
> lock - contention
> 2020/01/30 09:38:18.824671 ctdb-recoverd[174724]: Unable to take recovery
> lock
> 2020/01/30 09:38:18.824784 ctdb-recoverd[174724]: Retrying recovery
> 2020/01/30 09:38:19.822390 ctdb-recoverd[174724]:
> ../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
> 2020/01/30 09:38:19.822439 ctdb-recoverd[174724]: Attempting to take
> recovery lock (/cephfs/ctdb/.ctdb.lock)
> 2020/01/30 09:38:19.823006 ctdbd[174621]: 192.168.113.14:4379: connected
> to 192.168.113.13:4379 - 1 connected
> 2020/01/30 09:38:19.827168 ctdb-recoverd[174724]: Unable to take recovery
> lock - contention
> 2020/01/30 09:38:19.827369 ctdb-recoverd[174724]: Unable to take recovery
> lock
> 2020/01/30 09:38:19.827511 ctdb-recoverd[174724]: Retrying recovery
> 2020/01/30 09:38:20.820622 ctdbd[174621]: pnn 0 Invalid reqid 2649 in
> ctdb_reply_control
>
>
>
>
> On Thu, Jan 30, 2020 at 7:46 AM Wyllys Ingersoll <
> wyllys.ingersoll at keepertech.com> wrote:
>
>> Thanks, Ill try it and let you know.
>>
>> -Wyllys
>>
>> On Thu, Jan 30, 2020 at 12:15 AM Martin Schwenke <martin at meltin.net>
>> wrote:
>>
>>> On Wed, 29 Jan 2020 16:43:07 -0500, Wyllys Ingersoll via
>>> samba-technical <samba-technical at lists.samba.org> wrote:
>>>
>>> > I have a cluster in which I want to use both haproxy AND ctdb on the
>>> same
>>> > nodes - haproxy to manage NFSv4 (ganesha) and CTDB to manage SMB.  The
>>> > reason for doing this is that I've read several warnings about NOT
>>> using
>>> > NFSv4 with CTDB.
>>> >
>>> > haproxy + keepalived require that the net.ipv4.ip_nonlocal_bind flag
>>> be set
>>> > to 1 which breaks ctdb's ability to manage the virtual public IP
>>> addresses
>>> > (among other things).
>>> >
>>> > If I do not configure any public_addresses and just let haproxy
>>> configure
>>> > the virtual public IP addresses, CTDB is still unable to startup on
>>> both of
>>> > the nodes in my test cluster.  It will start on one or the other, but
>>> they
>>> > never sync up and come to an "OK" state on both nodes.
>>> >
>>> > I have the "node address" value set in the [cluster] section of
>>> ctdb.conf
>>> > on each node to be the private address of that node and both private
>>> > addresses are listed in the nodes configuration file and the nodes are
>>> > connecting to each other privately, but they don't stay connected and
>>> the
>>> > 2nd ctdb node never fully initializes and starts up.  At some point it
>>> just
>>> > begins flooding the logs with messages like this "node
>>> 192.168.113.14:4379
>>> > is already marked disconnected: 0 connected" and pegging the CPU at
>>> almost
>>> > 100% until the disk with the logging completely fills up (which sounds
>>> like
>>> > a bug, btw).
>>> >
>>> > Does anyone know of any way to make this sort of configuration work ?
>>> >
>>> > Currently running Samba 4.10.10, haproxy 1.6.3, and Linux Kernel
>>> 4.19.34 on
>>> > Ubuntu 16.04.4
>>> >
>>> > Any help would be much appreciated.
>>>
>>> Using "node address" should make this work.
>>>
>>> However, you're being bitten this bug:
>>>
>>>   https://bugzilla.samba.org/show_bug.cgi?id=14175
>>>
>>> This is fixed in Samba 4.10.13.
>>>
>>> I hope that upgrading makes this work for you.  Please let us know if
>>> it doesn't...
>>>
>>> peace & happiness,
>>> martin
>>>
>>