答复: ctdb 4.11.2 version failed to recover

Tue Dec 3 11:40:30 UTC 2019

>  I am wondering if you are using:

>  * ifdown <device> (which unassigned the IP address)

>  * ip link set <device> down (or ifconfig <device> down)

I used *ifdown <device> command. There are two purposes of my testing:
The most important one is to simulate network card failure. The second is that admin accidentally takes down the wrong interface.

-----邮件原件-----
发件人: Martin Schwenke [mailto:martin at meltin.net] 
发送时间: 2019年12月3日 11:23
收件人: 耿纪超 <gengjichao at jd.com>
抄送: samba-technical at lists.samba.org
主题: Re: ctdb 4.11.2 version failed to recover

Hi,

On Fri, 29 Nov 2019 07:20:59 +0000, 耿纪超 via samba-technical <samba-technical at lists.samba.org> wrote:

>        I use the ctdb of 4.11.2 version, include the newest 
> patch(https://bugzilla.samba.org/show_bug.cgi?id=14175). But,when I test the NIC exception, I encounter a problem, the ctdb cluster can not Recover Normal.
>        The test steps are as follows:
> 
> 1、 ctdb cluster have two ndoes, nodeA and nodeB，the cluster status is 
> OK。
> 
> 2、 Ifdown the nodeA and nodeB’s NIC, which config private IP.
> 
> 3、 After 25 seconds, nodeA and nodeB detech each other dead，then call 
> the functions: ctdb_tcp_restart->ctdb_tcp_node_connect,
> but bind failed, print the log:
> 
> node 10.240.226.211:4379 is dead: 0 connected
> 
> Tearing down connection to dead node :1
> 
> Failed to bind socket (Cannot assign requested address)

It really depends what you are trying to test and how you are doing it...

I am wondering if you are using:

* ifdown <device> (which unassigned the IP address)

* ip link set <device> down (or ifconfig <device> down)

The first of these definitely does not test anything like a hardware/link failure.  Normally, if a link goes down the IP address will stay on the interface.  This case is much more likely than the case where an admin accidentally takes down the wrong interface.

> solution:
>        when bind failed, no one will reestablish connections, even the 
> NIC is up. I think when bind failed, we should use the time to retry. 
> The patch is follow, I test it work well。
> --- a/ctdb/tcp/tcp_connect.c
> +++ b/ctdb/tcp/tcp_connect.c
> @@ -236,6 +236,11 @@ void ctdb_tcp_node_connect(struct tevent_context *ev, struct tevent_timer *te,
>                 DBG_ERR("Failed to bind socket (%s)\n", strerror(errno));
>                 close(tnode->out_fd);
>                 tnode->out_fd = -1;
> +               tnode->connect_te = tevent_add_timer(ctdb->ev,
> +                                                       tnode,
> +                                                       timeval_current_ofs(5, 0),
> +                                                       ctdb_tcp_node_connect,
> +                                                       node);
>                 return;
>         }

So, while you have identified a situation from which ctdbd does not recover and provided a possible fix, I would like to understand what you are trying to test before we agree on the best fix. ;-)

Thanks...

peace & happiness,
martin