[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume

Liu, Dan liud.fnst at cn.fujitsu.com
Mon Feb 25 02:43:31 UTC 2019


Hi all

We did some failover/failback tests on 2 nodes£¨A and B£© with architecture 'glusterfs + ctdb(public address) + nfs-ganesha'¡£

1st:
During write, unplug the network cable of serving node A
->NFS Client took a few seconds to recover to conitinue writing.

After some minutes, plug the network cable of serving node A
->NFS Client also took a few seconds to recover to conitinue writing.

2nd:
During write, unplug the network cable of serving node A
->NFS Client took 20 minutes to recover to conitinue writing.
It is too slow for clients to accept the recovery time¡£

From CTDB log, during failover and failback, fail node failed to kill the connection with client
while recovery node failed to send ¡®tickle ack¡¯to client to re-established connection.

So during 1~3s £¬takeover is failed¡£
Why is it failed to fast recovery and took 20 minutes to recovery successfully.
Is there anyone knows the reason?
We are looking forward to your reply. Thanks.

-------------------------------------------------------------------------------------------------------------
The following is some test logs and configuration.
Node A£º
cat /var/log/log.ctdb
2019/02/22 18:00:57.468629 ctdbd[18309]: Release of IP 10.10.11.51/24 on interface eth3  node:1
2019/02/22 18:01:02.132565 ctdbd[18309]: Monitoring event was cancelled
2019/02/22 18:01:02.547046 ctdb-eventd[18310]: 10.interface: Killing TCP connection ::ffff:10.10.11.18:951 ::ffff:10.10.11.51:2049
2019/02/22 18:01:02.547112 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host)
...
2019/02/22 18:01:02.547259 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host)
2019/02/22 18:01:02.548458 ctdb-eventd[18310]: 10.interface: Failed to kill TCP connections for IP 10.10.11.51 (1/1 remaining)
2019/02/22 18:01:02.680399 ctdb-eventd[18310]: 60.nfs: method return time=1550829662.675715 sender=:1.1803 -> destination=:1.1819 serial=445 reply_serial=2
2019/02/22 18:01:02.680479 ctdb-eventd[18310]: 60.nfs:    boolean true
2019/02/22 18:01:02.680500 ctdb-eventd[18310]: 60.nfs:    string "Started grace period"
2019/02/22 18:01:03.255313 ctdb-eventd[18310]: 60.nfs: Reconfiguring service "nfs"...
2019/02/22 18:01:03.353830 ctdb-recoverd[18402]: Takeover run completed successfully
2019/02/22 18:01:05.345783 ctdbd[18309]: Starting traverse on DB ctdb.tdb (id 9809)
2019/02/22 18:01:05.348204 ctdbd[18309]: Ending traverse on DB ctdb.tdb (id 9809), records 1

Node B£º
cat /var/log/log.ctdb
2019/02/22 18:01:02.699755 ctdbd[29541]: Takeover of IP 10.10.11.51/24 on interface eth3
2019/02/22 18:01:02.701360 ctdbd[29541]: Monitoring event was cancelled
2019/02/22 18:01:03.010811 ctdb-eventd[29542]: 60.nfs: removed ¡®/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51¡¯
2019/02/22 18:01:03.010896 ctdb-eventd[29542]: 60.nfs: ¡®/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51¡¯ -> ¡®/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/node-4¡¯
2019/02/22 18:01:03.010922 ctdb-eventd[29542]: 60.nfs: method return time=1550829663.005719 sender=:1.192 -> destination=:1.206 serial=438 reply_serial=2
2019/02/22 18:01:03.010937 ctdb-eventd[29542]: 60.nfs:    boolean true
2019/02/22 18:01:03.010973 ctdb-eventd[29542]: 60.nfs:    string "Started grace period"
2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:03.065191 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:03.303342 ctdb-eventd[29542]: 60.nfs: Reconfiguring service "nfs"...
2019/02/22 18:01:03.347137 ctdb-recoverd[29647]: Reenabling takeover runs
2019/02/22 18:01:04.172108 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:04.172180 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:05.278093 ctdbd[29541]: Failed sendto (No route to host)
2019/02/22 18:01:05.278159 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18
2019/02/22 18:01:05.389656 ctdbd[29541]: Starting traverse on DB ctdb.tdb (id 6238)
2019/02/22 18:01:05.392182 ctdbd[29541]: Ending traverse on DB ctdb.tdb (id 6238), records 1

cat /etc/sysconfig/ctdb
CTDB_RECOVERY_LOCK=/mnt/mgt_vol/grp45/lockfile
CTDB_PUBLIC_INTERFACE=eth3
CTDB_NODES=/mnt/mgt_vol/grp45/nodes
CTDB_PUBLIC_ADDRESSES=/mnt/mgt_vol/grp45/public_addresses
CTDB_MANAGES_SAMBA=yes
CTDB_MANAGES_WINBIND=no
CTDB_MANAGES_VSFTP=yes
CTDB_SAMBA_SKIP_SHARE_CHECK=yes
CTDB_MANAGES_NFS=yes
CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout
CTDB_NFS_STATE_FS_TYPE=glusterfs
CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d/
CTDB_NFS_STATE_MNT=/mnt/mgt_vol/grp45/nfs_state
CTDB_NFS_SKIP_SHARE_CHECK=yes
CTDB_SET_KeepaliveLimit=1

cat /mnt/mgt_vol/grp45/nodes
192.168.100.15 #inner network
192.168.100.14 #inner network

cat /mnt/mgt_vol/grp45/public_addresses
10.10.11.50/24 eth3 #extranet network
10.10.11.51/24 eth3 #extranet network


ÒÔÉÏ¡¢¤è¤í¤·¤¯¤ªîŠ¤¤¤¤¤¿¤·¤Þ¤¹¡£
--------------------------------------------------
**************************************************
Liu Dan
PF Dept
Nanjing Fujitsu Nanda Software Tech.Co.,Ltd.(FNST)
TEL£º+86+25-86630566-8512
FUJITSU INTERNAL£º79955-8512
EMail: liud.fnst at cn.fujitsu.com<mailto:liud.fnst at cn.fujitsu.com>
**************************************************
--------------------------------------------------





More information about the samba mailing list