[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume

Liu, Dan liud.fnst at cn.fujitsu.com
Mon Mar 4 08:38:34 UTC 2019


Martin

Thanks for replying.

> > We did some failover/failback tests on 2 nodes(A and B) with
> > architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。
> >
> > 1st:
> > During write, unplug the network cable of serving node A
> > ->NFS Client took a few seconds to recover to conitinue writing.
> >
> > After some minutes, plug the network cable of serving node A
> > ->NFS Client also took a few seconds to recover to conitinue
> > writing.
> >
> > 2nd:
> > During write, unplug the network cable of serving node A
> > ->NFS Client took 20 minutes to recover to conitinue writing.
> > It is too slow for clients to accept the recovery time。
> 
> Definitely!  What was different between "1st" and "2nd"?  Were they testing
> different scenarios?

"1st" and "2nd" is tested in the same scenarios.
After we updated the client(Mac mini)'s os to Mojave v10.14.3(the problem happened on v10.10), not like '2nd',
all the take over took very few minutes to finish. 
So I'm think that the ~20 mins take over might be the client's problem. 


> > From CTDB log, during failover and failback, fail node failed to kill
> > the connection with client while recovery node failed to send ‘tickle
> > ack’to client to re-established connection.
> 
> The first really isn't a problem.  I'm not sure why CTDB attempts to do
> a 2 way kill from the releasing node.  We're going to stop doing that in
> the future.
> 
> The 2nd is a mystery.  Are you sure the network connection on node B was
> up?  This message seems to indicate the network is down:
> 
>   2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host)
> 2019/02/22 18:01:03.065191
> ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp
> tickle ack for ::ffff:10.10.11.18
All the test procedure, we just unplug the node A's network.
From all the test's ctdb log(over 10 times), during every takeover there was such log to be outputed.
When the tickle ACK was sent, because takeover is successful after some minutes, I'm think that the node B's was up, but not sure..

Best Regards

> -----Original Message-----
> From: Martin Schwenke [mailto:martin at meltin.net]
> Sent: Monday, March 4, 2019 1:55 PM
> To: Liu, Dan/刘 丹 <liud.fnst at cn.fujitsu.com>
> Cc: samba at lists.samba.org
> Subject: Re: [Samba] glusterfs + ctdb + nfs-ganesha , unplug the network
> cable of serving node, takes around ~20 mins for IO to resume
> 
> Hi Dan,
> 
> On Mon, 25 Feb 2019 02:43:31 +0000, "Liu, Dan via samba"
> <samba at lists.samba.org> wrote:
> 
> > We did some failover/failback tests on 2 nodes(A and B) with
> > architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。
> >
> > 1st:
> > During write, unplug the network cable of serving node A
> > ->NFS Client took a few seconds to recover to conitinue writing.
> >
> > After some minutes, plug the network cable of serving node A
> > ->NFS Client also took a few seconds to recover to conitinue
> > writing.
> >
> > 2nd:
> > During write, unplug the network cable of serving node A
> > ->NFS Client took 20 minutes to recover to conitinue writing.
> > It is too slow for clients to accept the recovery time。
> 
> Definitely!  What was different between "1st" and "2nd"?  Were they testing
> different scenarios?
> 
> > From CTDB log, during failover and failback, fail node failed to kill
> > the connection with client while recovery node failed to send ‘tickle
> > ack’to client to re-established connection.
> 
> The first really isn't a problem.  I'm not sure why CTDB attempts to do
> a 2 way kill from the releasing node.  We're going to stop doing that in
> the future.
> 
> The 2nd is a mystery.  Are you sure the network connection on node B was
> up?  This message seems to indicate the network is down:
> 
>   2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host)
> 2019/02/22 18:01:03.065191
> ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp
> tickle ack for ::ffff:10.10.11.18
> 
> peace & happiness,
> martin
> 





More information about the samba mailing list