答复: ctdb 4.11.2 version failed to recover
gengjichao at jd.com
Tue Dec 3 11:40:30 UTC 2019
> I am wondering if you are using:
> * ifdown <device> (which unassigned the IP address)
> * ip link set <device> down (or ifconfig <device> down)
I used *ifdown <device> command. There are two purposes of my testing:
The most important one is to simulate network card failure. The second is that admin accidentally takes down the wrong interface.
发件人: Martin Schwenke [mailto:martin at meltin.net]
发送时间: 2019年12月3日 11:23
收件人: 耿纪超 <gengjichao at jd.com>
抄送: samba-technical at lists.samba.org
主题: Re: ctdb 4.11.2 version failed to recover
On Fri, 29 Nov 2019 07:20:59 +0000, 耿纪超 via samba-technical <samba-technical at lists.samba.org> wrote:
> I use the ctdb of 4.11.2 version, include the newest
> patch(https://bugzilla.samba.org/show_bug.cgi?id=14175). But,when I test the NIC exception, I encounter a problem, the ctdb cluster can not Recover Normal.
> The test steps are as follows:
> 1、 ctdb cluster have two ndoes, nodeA and nodeB，the cluster status is
> 2、 Ifdown the nodeA and nodeB’s NIC, which config private IP.
> 3、 After 25 seconds, nodeA and nodeB detech each other dead，then call
> the functions: ctdb_tcp_restart->ctdb_tcp_node_connect,
> but bind failed, print the log:
> node 10.240.226.211:4379 is dead: 0 connected
> Tearing down connection to dead node :1
> Failed to bind socket (Cannot assign requested address)
It really depends what you are trying to test and how you are doing it...
I am wondering if you are using:
* ifdown <device> (which unassigned the IP address)
* ip link set <device> down (or ifconfig <device> down)
The first of these definitely does not test anything like a hardware/link failure. Normally, if a link goes down the IP address will stay on the interface. This case is much more likely than the case where an admin accidentally takes down the wrong interface.
> when bind failed, no one will reestablish connections, even the
> NIC is up. I think when bind failed, we should use the time to retry.
> The patch is follow, I test it work well。
> --- a/ctdb/tcp/tcp_connect.c
> +++ b/ctdb/tcp/tcp_connect.c
> @@ -236,6 +236,11 @@ void ctdb_tcp_node_connect(struct tevent_context *ev, struct tevent_timer *te,
> DBG_ERR("Failed to bind socket (%s)\n", strerror(errno));
> tnode->out_fd = -1;
> + tnode->connect_te = tevent_add_timer(ctdb->ev,
> + tnode,
> + timeval_current_ofs(5, 0),
> + ctdb_tcp_node_connect,
> + node);
So, while you have identified a situation from which ctdbd does not recover and provided a possible fix, I would like to understand what you are trying to test before we agree on the best fix. ;-)
peace & happiness,
More information about the samba-technical