[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume

Mon Mar 4 05:54:56 UTC 2019

Hi Dan,

On Mon, 25 Feb 2019 02:43:31 +0000, "Liu, Dan via samba"
<samba at lists.samba.org> wrote:

> We did some failover/failback tests on 2 nodes（A and B） with
> architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。
> 
> 1st:
> During write, unplug the network cable of serving node A
> ->NFS Client took a few seconds to recover to conitinue writing.  
> 
> After some minutes, plug the network cable of serving node A
> ->NFS Client also took a few seconds to recover to conitinue
> writing.  
> 
> 2nd:
> During write, unplug the network cable of serving node A
> ->NFS Client took 20 minutes to recover to conitinue writing.  
> It is too slow for clients to accept the recovery time。

Definitely!  What was different between "1st" and "2nd"?  Were they
testing different scenarios?

> From CTDB log, during failover and failback, fail node failed to kill
> the connection with client while recovery node failed to send ‘tickle
> ack’to client to re-established connection.

The first really isn't a problem.  I'm not sure why CTDB attempts to do
a 2 way kill from the releasing node.  We're going to stop doing that
in the future.

The 2nd is a mystery.  Are you sure the network connection on node B
was up?  This message seems to indicate the network is down:

  2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host) 2019/02/22 18:01:03.065191 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18

peace & happiness,
martin