Error in Setup File Server Cluster with Samba

Wed May 31 08:21:08 UTC 2017

Hi Martin Schwenke, Amitay Issacs

This is my diagram

Route in File 01:
 - eth1(10.1.21.84) only connect to SAN1 eth0 (10.1.21.86)
 - eth0(10.1.21.83) will connect to Client
------------------------
[root at file1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use
Iface
*10.1.21.86      10.1.21.84      255.255.255.255 UGH   0      0        0
eth1*
10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth1
172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0 eth2
172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0 eth3
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 eth1
169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 eth2
169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0 eth3
*0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
eth0*
------------------------------

Route in File 02:
 - eth1(10.1.21.82) only connect to SAN2 eth0 (10.1.21.87)
 - eth0(10.1.21.117) will connect to Client
-------------------------
[root at file2 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use
Iface
*10.1.21.87      10.1.21.82      255.255.255.255 UGH   0      0        0
eth1*
10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth1
172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0 eth2
172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0 eth3
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 eth1
169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 eth2
169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0 eth3
*0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
eth0*
-----------------------------------
eth1 in File 01 and File 02 only connect to SAN, Client cannot connect to
eth1 in both server

********************************************
In file ctdb, I configured
*CTDB_LOGLEVEL=DEBUG*

*********************************************
Client01(10.1.31.151) is connecting and copying to File server Cluster
through node 0(File 01: 10.1.21.83). After I make command: "*ifdown eth0*"
in File 01, then Client01 had disconnected to File server and cannot copy
file.
This is ctdb status in File 02, this status is BANNED
------------------------------------
Number of nodes:2
pnn:0 10.1.21.83       DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.1.21.117      *BANNED*|*INACTIVE (THIS NODE)*
Generation:INVALID
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:RECOVERY (1)
Recovery master:1
------------------------------------

*********************************************
And new issue: I ping from my client to eth0 in File 01 (10.1.21.83), some
packet timeout --> It's not stability. I must restart network
I attach 2 file log.ctdb in both File Server. Please help me to fix it.
Thanks so much

Regards,
Giang

2017-05-31 8:48 GMT+07:00 Martin Schwenke <martin at meltin.net>:

> Hi Giang,
>
> Can you please let us know what CTDB version you're using?
>
> Can you please also run with a higher debug level (as Amitay
> requested) so we get more context for what is happening?
>
> Initial comments:
>
> * It is strange that you're seeing:
>
>   > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
>   > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node
> 10.1.21.83:4379
>   > is dead: 0 connected
>
>   when disconnecting the client network.  This should only happen if
>   the internal, private is disconnected.  Is your diagram accurate?  Is
>   eth1 really a different physical interface?
>
> * The following tells us that glusterfs still seems to be working
>   across both nodes:
>
>   > 2017/05/26 21:47:59.240133 [recoverd: 3720]:
> server/ctdb_recoverd.c:1765
>   > Starting do_recovery
>   > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock
> from
>   > recovery daemon
>   > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
>   > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock:
> Failed to
>   > get recovery lock on '/data/lock1/lockfile'
>   > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery
> lock -
>   > aborting recovery and ban ourself for 300 seconds
>   > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300
> seconds
>   > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds
>
> * Sending a TCP tickle ACK in the following context should only happen
>   on the takeover node:
>
>   > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed
> to
>   > send tcp tickle ack for 10.10.31.151*
>
>   The IP address "fail back" due to the above ban.
>
> The real question is why CTDB thinks a node goes away when you
> disconnect the public/client network.
>
> peace & happiness,
> martin
>
> On Wed, 31 May 2017 08:19:37 +0700, GiangCoi Mr via samba-technical
> <samba-technical at lists.samba.org> wrote:
>
> > Hi Team
> > Please help me to fix this issue.
> >
> > Regards,
> > Giang
> >
> >
> > 2017-05-30 18:22 GMT+07:00 GiangCoi Mr <ltrgiang86 at gmail.com>:
> >
> > > Hi Amitay Isaacs
> > >
> > > This is log.ctdb in File Server 01 when I disconnect eth0 in File 01
> when
> > > Client 10.1.31.151 (other subnet) is copying files to File Server
> > >
> > > ---------------------------
> > > 2017/05/26 21:47:55.991662 [ 3942]: Ending traverse on DB brlock.tdb
> (id
> > > 21785), records 0
> > > 2017/05/26 21:47:56.121928 [ 3942]: common/system_linux.c:364 failed
> > > sendto (Network is unreachable)
> > > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed
> to
> > > send tcp tickle ack for 10.10.31.151*
> > > *2017/05/26 21:47:57.234419 [ 3942]: common/system_linux.c:364 failed
> > > sendto (Network is unreachable)*
> > > *2017/05/26 21:47:57.234542 [ 3942]: server/ctdb_takeover.c:345 Failed
> to
> > > send tcp tickle ack for 10.10.31.151*
> > > 2017/05/26 21:48:06.002174 [recoverd: 4129]: The rerecovery timeout has
> > > elapsed. We now allow recoveries to trigger again.
> > > 2017/05/26 21:48:20.332193 [ 3942]: Could not find idr:21511
> > > 2017/05/26 21:48:20.332289 [ 3942]: pnn 0 Invalid reqid 21511 in
> > > ctdb_reply_control
> > > 2017/05/26 21:48:23.334985 [recoverd: 4129]:
> server/ctdb_recoverd.c:1139
> > > Election timed out
> > > 2017/05/26 21:48:24.788899 [ 3942]: 10.1.21.83:4379: connected to
> > > 10.1.21.117:4379 - 1 connected
> > > 2017/05/26 21:49:25.083127 [ 3942]: Recovery daemon ping timeout.
> Count : 0
> > > 2017/05/26 21:49:25.083446 [recoverd: 4129]: ctdb_control error:
> > > 'ctdb_control timed out'
> > > 2017/05/26 21:49:25.083546 [recoverd: 4129]: ctdb_control error:
> > > 'ctdb_control timed out'
> > > 2017/05/26 21:49:25.083579 [recoverd: 4129]: Async operation failed
> with
> > > ret=-1 res=-1 opcode=80
> > > 2017/05/26 21:49:25.083596 [recoverd: 4129]: Async wait failed -
> > > fail_count=1
> > > 2017/05/26 21:49:25.083613 [recoverd: 4129]: server/ctdb_recoverd.c:345
> > > Failed to read node capabilities.
> > > 2017/05/26 21:49:25.083631 [recoverd: 4129]:
> server/ctdb_recoverd.c:3685
> > > Unable to update node capabilities.
> > > ------------------------------------------------------------
> > > ---------------------
> > >
> > > And this is log.ctdb in File Server 02
> > > ------------------------------------------
> > > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
> > > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node
> 10.1.21.83:4379
> > > is dead: 0 connected
> > > 2017/05/26 21:47:56.227776 [ 3529]: Tearing down connection to dead
> node :0
> > > 2017/05/26 21:47:56.227853 [recoverd: 3720]: ctdb_control error: 'node
> is
> > > disconnected'
> > > 2017/05/26 21:47:56.227870 [recoverd: 3720]: ctdb_control error: 'node
> is
> > > disconnected'
> > > 2017/05/26 21:47:56.227887 [recoverd: 3720]: Async operation failed
> with
> > > ret=-1 res=-1 opcode=80
> > > 2017/05/26 21:47:56.227892 [recoverd: 3720]: Async wait failed -
> > > fail_count=1
> > > 2017/05/26 21:47:56.227895 [recoverd: 3720]: server/ctdb_recoverd.c:345
> > > Failed to read node capabilities.
> > > 2017/05/26 21:47:56.227900 [recoverd: 3720]:
> server/ctdb_recoverd.c:3685
> > > Unable to update node capabilities.
> > > 2017/05/26 21:47:56.228857 [recoverd: 3720]: Recmaster node 0 is
> > > disconnected. Force reelection
> > > 2017/05/26 21:47:56.228930 [ 3529]: Freeze priority 1
> > > 2017/05/26 21:47:56.229955 [ 3529]: Freeze priority 2
> > > 2017/05/26 21:47:56.230859 [ 3529]: Freeze priority 3
> > > 2017/05/26 21:47:56.231524 [ 3529]: server/ctdb_recover.c:612 Recovery
> > > mode set to ACTIVE
> > > 2017/05/26 21:47:56.231828 [ 3529]: This node (1) is now the recovery
> > > master
> > > 2017/05/26 21:47:59.236415 [recoverd: 3720]:
> server/ctdb_recoverd.c:1139
> > > Election timed out
> > > 2017/05/26 21:47:59.240023 [recoverd: 3720]: Node:1 was in recovery
> mode.
> > > Start recovery process
> > > 2017/05/26 21:47:59.240133 [recoverd: 3720]:
> server/ctdb_recoverd.c:1765
> > > Starting do_recovery
> > > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock
> from
> > > recovery daemon
> > > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
> > > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock:
> Failed to
> > > get recovery lock on '/data/lock1/lockfile'
> > > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery
> lock -
> > > aborting recovery and ban ourself for 300 seconds
> > > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300
> seconds
> > > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds
> > >
> > >
> > > I read in ctdb.samba.org:
> > >
> > > IP TakeoverWhen a node in a cluster fails, CTDB will arrange that a
> > > different node takes over the IP address of the failed node to ensure
> that
> > > the IP addresses for the services provided are always available.
> > >
> > > To speed up the process of IP takeover and when clients attached to a
> > > failed node recovers as fast as possible, CTDB will automatically
> generate
> > > gratuitous ARP packets to inform all nodes of the changed MAC address
> for
> > > that IP. CTDB will also send "tickle ACK" packets to all attached
> clients
> > > to trigger the clients to immediately recognize that the TCP connection
> > > needs to be re-established and to shortcut any TCP retransmission
> timeouts
> > > that may be active in the clients.
> > >
> > > I guess, CTDB in File server 02 have to send tickle ACK to Client, but
> in
> > > this situation, File Server 01 send tickle ACK when eth0 in File
> Server 01
> > > down.
> > >
> > > And your question: Do you have any firewall on your Cisco router?
> > >
> > > We don't have any firewall between 2 subnets. Thanks so much
> > >
> > > Regards,
> > >
> > > Giang
> > >
> > >
> > >
> > >
> > > 2017-05-30 17:42 GMT+07:00 Amitay Isaacs <amitay at gmail.com>:
> > >
> > >>
> > >> On Tue, May 30, 2017 at 4:17 PM, GiangCoi Mr via samba-technical <
> > >> samba-technical at lists.samba.org> wrote:
> > >>
> >  [...]
> > >>
> > >> Can you paste the exact entry from CTDB's log?
> > >>
> > >> Also, set debug level to NOTICE in ctdb configuration.
> > >> CTDB_LOGLEVEL=NOTICE
> > >>
> > >>
> >  [...]
> > >>
> > >> Do you have any firewall on your Cisco router?
> > >>
> > >>
> >  [...]
> > >>
> > >> Amitay.
> > >>
> > >
> > >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Diagram.jpg
Type: image/jpeg
Size: 54403 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20170531/873e9d92/Diagram.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CTDB Log.rar
Type: application/rar
Size: 1360 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20170531/873e9d92/CTDBLog.rar>