ctdb 4.11.2 version failed to recover

Martin Schwenke martin at meltin.net
Tue Dec 3 03:23:21 UTC 2019


On Fri, 29 Nov 2019 07:20:59 +0000, 耿纪超 via samba-technical
<samba-technical at lists.samba.org> wrote:

>        I use the ctdb of 4.11.2 version, include the newest patch(https://bugzilla.samba.org/show_bug.cgi?id=14175). But,when I test the NIC exception, I encounter a problem, the ctdb cluster can not
> Recover Normal.
>        The test steps are as follows:
> 1、 ctdb cluster have two ndoes, nodeA and nodeB,the cluster status is OK。
> 2、 Ifdown the nodeA and nodeB’s NIC, which config private IP.
> 3、 After 25 seconds, nodeA and nodeB detech each other dead,then call the functions: ctdb_tcp_restart->ctdb_tcp_node_connect,
> but bind failed, print the log:
> node is dead: 0 connected
> Tearing down connection to dead node :1
> Failed to bind socket (Cannot assign requested address)

It really depends what you are trying to test and how you are doing

I am wondering if you are using:

* ifdown <device> (which unassigned the IP address)

* ip link set <device> down (or ifconfig <device> down)

The first of these definitely does not test anything like a
hardware/link failure.  Normally, if a link goes down the IP address
will stay on the interface.  This case is much more likely than the
case where an admin accidentally takes down the wrong interface.

> solution:
>        when bind failed, no one will reestablish connections, even the NIC is up. I think when bind failed, we should use the time to retry. The patch is follow, I test it work well。
> --- a/ctdb/tcp/tcp_connect.c
> +++ b/ctdb/tcp/tcp_connect.c
> @@ -236,6 +236,11 @@ void ctdb_tcp_node_connect(struct tevent_context *ev, struct tevent_timer *te,
>                 DBG_ERR("Failed to bind socket (%s)\n", strerror(errno));
>                 close(tnode->out_fd);
>                 tnode->out_fd = -1;
> +               tnode->connect_te = tevent_add_timer(ctdb->ev,
> +                                                       tnode,
> +                                                       timeval_current_ofs(5, 0),
> +                                                       ctdb_tcp_node_connect,
> +                                                       node);
>                 return;
>         }

So, while you have identified a situation from which ctdbd does not
recover and provided a possible fix, I would like to understand what
you are trying to test before we agree on the best fix. ;-)


peace & happiness,

More information about the samba-technical mailing list