ctdb, haproxy, and ip_nonlocal_bind

Thu Jan 30 14:42:42 UTC 2020

Updated to 4.10.13, and now it doesn't get into an endless loop of failure
like before and flood the logs, but the 2nd node still isn't able to rejoin
with the first, the following logs appear every minute or so.
Note the recovery lock is on the shared filesystem (/cephfs).  The first
node shows "OK" status for node1 and UNHEALTHY  for node 2, but the node2
says that both are UNHEALTHY.

2020/01/30 09:38:16.811921 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is dead: 0 connected
2020/01/30 09:38:16.811999 ctdbd[174621]: Tearing down connection to dead
node :1
2020/01/30 09:38:16.812527 ctdb-recoverd[174724]: ctdb_control error: 'node
is disconnected'
2020/01/30 09:38:16.812554 ctdb-recoverd[174724]: ctdb_control error: 'node
is disconnected'
2020/01/30 09:38:16.812568 ctdb-recoverd[174724]: Async operation failed
with ret=-1 res=-1 opcode=80
2020/01/30 09:38:16.812577 ctdb-recoverd[174724]: Async wait failed -
fail_count=1
2020/01/30 09:38:16.812585 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_client.c:1920 Failed to read node capabilities.
2020/01/30 09:38:16.812593 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
2020/01/30 09:38:16.812600 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:2756 Unable to update node capabilities.
2020/01/30 09:38:16.813127 ctdbd[174621]: 192.168.113.14:4379: connected to
192.168.113.13:4379 - 1 connected
2020/01/30 09:38:16.813258 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is dead: 0 connected
2020/01/30 09:38:16.813279 ctdbd[174621]: Tearing down connection to dead
node :1
2020/01/30 09:38:16.814436 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
2020/01/30 09:38:16.814479 ctdb-recoverd[174724]: Attempting to take
recovery lock (/cephfs/ctdb/.ctdb.lock)
2020/01/30 09:38:16.816249 ctdbd[174621]: 192.168.113.14:4379: connected to
192.168.113.13:4379 - 1 connected
2020/01/30 09:38:16.816507 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is dead: 0 connected
2020/01/30 09:38:16.816532 ctdbd[174621]: Tearing down connection to dead
node :1
2020/01/30 09:38:16.817054 ctdbd[174621]: 192.168.113.14:4379: connected to
192.168.113.13:4379 - 1 connected
2020/01/30 09:38:16.817555 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is dead: 0 connected
2020/01/30 09:38:16.817580 ctdbd[174621]: Tearing down connection to dead
node :1
2020/01/30 09:38:16.819106 ctdbd[174621]: 192.168.113.14:4379: connected to
192.168.113.13:4379 - 1 connected
2020/01/30 09:38:16.819276 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is dead: 0 connected
2020/01/30 09:38:16.819324 ctdbd[174621]: Tearing down connection to dead
node :1
2020/01/30 09:38:16.819825 ctdbd[174621]: 192.168.113.14:4379: node
192.168.113.13:4379 is already marked disconnected: 0 connected
2020/01/30 09:38:16.823817 ctdb-recoverd[174724]: Unable to take recovery
lock - contention
2020/01/30 09:38:16.823972 ctdb-recoverd[174724]: Unable to take recovery
lock
2020/01/30 09:38:16.824173 ctdb-recoverd[174724]: Retrying recovery
2020/01/30 09:38:17.816679 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
2020/01/30 09:38:17.816731 ctdb-recoverd[174724]: Attempting to take
recovery lock (/cephfs/ctdb/.ctdb.lock)
2020/01/30 09:38:17.824619 ctdb-recoverd[174724]: Unable to take recovery
lock - contention
2020/01/30 09:38:17.824907 ctdb-recoverd[174724]: Unable to take recovery
lock
2020/01/30 09:38:17.825121 ctdb-recoverd[174724]: Retrying recovery
2020/01/30 09:38:18.818706 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
2020/01/30 09:38:18.818752 ctdb-recoverd[174724]: Attempting to take
recovery lock (/cephfs/ctdb/.ctdb.lock)
2020/01/30 09:38:18.824518 ctdb-recoverd[174724]: Unable to take recovery
lock - contention
2020/01/30 09:38:18.824671 ctdb-recoverd[174724]: Unable to take recovery
lock
2020/01/30 09:38:18.824784 ctdb-recoverd[174724]: Retrying recovery
2020/01/30 09:38:19.822390 ctdb-recoverd[174724]:
../../ctdb/server/ctdb_recoverd.c:1342 Starting do_recovery
2020/01/30 09:38:19.822439 ctdb-recoverd[174724]: Attempting to take
recovery lock (/cephfs/ctdb/.ctdb.lock)
2020/01/30 09:38:19.823006 ctdbd[174621]: 192.168.113.14:4379: connected to
192.168.113.13:4379 - 1 connected
2020/01/30 09:38:19.827168 ctdb-recoverd[174724]: Unable to take recovery
lock - contention
2020/01/30 09:38:19.827369 ctdb-recoverd[174724]: Unable to take recovery
lock
2020/01/30 09:38:19.827511 ctdb-recoverd[174724]: Retrying recovery
2020/01/30 09:38:20.820622 ctdbd[174621]: pnn 0 Invalid reqid 2649 in
ctdb_reply_control

On Thu, Jan 30, 2020 at 7:46 AM Wyllys Ingersoll <
wyllys.ingersoll at keepertech.com> wrote:

> Thanks, Ill try it and let you know.
>
> -Wyllys
>
> On Thu, Jan 30, 2020 at 12:15 AM Martin Schwenke <martin at meltin.net>
> wrote:
>
>> On Wed, 29 Jan 2020 16:43:07 -0500, Wyllys Ingersoll via
>> samba-technical <samba-technical at lists.samba.org> wrote:
>>
>> > I have a cluster in which I want to use both haproxy AND ctdb on the
>> same
>> > nodes - haproxy to manage NFSv4 (ganesha) and CTDB to manage SMB.  The
>> > reason for doing this is that I've read several warnings about NOT using
>> > NFSv4 with CTDB.
>> >
>> > haproxy + keepalived require that the net.ipv4.ip_nonlocal_bind flag be
>> set
>> > to 1 which breaks ctdb's ability to manage the virtual public IP
>> addresses
>> > (among other things).
>> >
>> > If I do not configure any public_addresses and just let haproxy
>> configure
>> > the virtual public IP addresses, CTDB is still unable to startup on
>> both of
>> > the nodes in my test cluster.  It will start on one or the other, but
>> they
>> > never sync up and come to an "OK" state on both nodes.
>> >
>> > I have the "node address" value set in the [cluster] section of
>> ctdb.conf
>> > on each node to be the private address of that node and both private
>> > addresses are listed in the nodes configuration file and the nodes are
>> > connecting to each other privately, but they don't stay connected and
>> the
>> > 2nd ctdb node never fully initializes and starts up.  At some point it
>> just
>> > begins flooding the logs with messages like this "node
>> 192.168.113.14:4379
>> > is already marked disconnected: 0 connected" and pegging the CPU at
>> almost
>> > 100% until the disk with the logging completely fills up (which sounds
>> like
>> > a bug, btw).
>> >
>> > Does anyone know of any way to make this sort of configuration work ?
>> >
>> > Currently running Samba 4.10.10, haproxy 1.6.3, and Linux Kernel
>> 4.19.34 on
>> > Ubuntu 16.04.4
>> >
>> > Any help would be much appreciated.
>>
>> Using "node address" should make this work.
>>
>> However, you're being bitten this bug:
>>
>>   https://bugzilla.samba.org/show_bug.cgi?id=14175
>>
>> This is fixed in Samba 4.10.13.
>>
>> I hope that upgrading makes this work for you.  Please let us know if
>> it doesn't...
>>
>> peace & happiness,
>> martin
>>
>