Error in Setup File Server Cluster with Samba

Martin Schwenke martin at meltin.net
Wed May 31 11:15:12 UTC 2017


Hi Giang,

I can see 2 problems:

1. You don't seem to have a separate private/internal network for
   internal CTDB communications.

   In your original message you said that the CTDB nodes configuration
   was:

     File nodes
     vi /data/lock/nodes
     172.16.0.1
     172.16.0.2

   In your latest diagram I don't see any special interfaces for these
   networks.  It looks like traffic between these addresses is going
   via the default route, which uses eth0.  If you take eth0 down then
   CTDB can not communicate between the 2 nodes.

2. "ifdown eth0" is not a valid test.

   This does not represent a real fault that is likely to occur.  CTDB
   tests for link on an interface (using ethtool), because link failure
   is the most likely fault.  In fact, CTDB forces the interface state
   up when monitoring an interface (using "ip link set up dev X", so
   this does not configure addresses removed by "ifdown X").

I also find the overlap in the networks between eth0 and eth1 to be
confusing.  I guess that this might not cause problems...

I hope this helps...

peace & happiness,
martin

On Wed, 31 May 2017 15:21:08 +0700, GiangCoi Mr <ltrgiang86 at gmail.com>
wrote:

> Hi Martin Schwenke, Amitay Issacs
> 
> This is my diagram
> 
>> Route in File 01:
>  - eth1(10.1.21.84) only connect to SAN1 eth0 (10.1.21.86)
>  - eth0(10.1.21.83) will connect to Client
> ------------------------
> [root at file1 ~]# route -n
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use
> Iface
> *10.1.21.86      10.1.21.84      255.255.255.255 UGH   0      0        0
> eth1*
> 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
> 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth1
> 172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0 eth2
> 172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0 eth3
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 eth1
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 eth2
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0 eth3
> *0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
> eth0*
> ------------------------------
> 
> Route in File 02:
>  - eth1(10.1.21.82) only connect to SAN2 eth0 (10.1.21.87)
>  - eth0(10.1.21.117) will connect to Client
> -------------------------
> [root at file2 ~]# route -n
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use
> Iface
> *10.1.21.87      10.1.21.82      255.255.255.255 UGH   0      0        0
> eth1*
> 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
> 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0 eth1
> 172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0 eth2
> 172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0 eth3
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 eth1
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 eth2
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0 eth3
> *0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
> eth0*
> -----------------------------------
> eth1 in File 01 and File 02 only connect to SAN, Client cannot connect to
> eth1 in both server
> 
> ********************************************
> In file ctdb, I configured
> *CTDB_LOGLEVEL=DEBUG*
> 
> *********************************************
> Client01(10.1.31.151) is connecting and copying to File server Cluster
> through node 0(File 01: 10.1.21.83). After I make command: "*ifdown eth0*"
> in File 01, then Client01 had disconnected to File server and cannot copy
> file.
> This is ctdb status in File 02, this status is BANNED
> ------------------------------------
> Number of nodes:2
> pnn:0 10.1.21.83       DISCONNECTED|UNHEALTHY|INACTIVE
> pnn:1 10.1.21.117      *BANNED*|*INACTIVE (THIS NODE)*
> Generation:INVALID
> Size:2
> hash:0 lmaster:0
> hash:1 lmaster:1
> Recovery mode:RECOVERY (1)
> Recovery master:1
> ------------------------------------
> 
> *********************************************
> And new issue: I ping from my client to eth0 in File 01 (10.1.21.83), some
> packet timeout --> It's not stability. I must restart network
> I attach 2 file log.ctdb in both File Server. Please help me to fix it.
> Thanks so much
> 
> Regards,
> Giang
> 
> 
> 2017-05-31 8:48 GMT+07:00 Martin Schwenke <martin at meltin.net>:
> 
> > Hi Giang,
> >
> > Can you please let us know what CTDB version you're using?
> >
> > Can you please also run with a higher debug level (as Amitay
> > requested) so we get more context for what is happening?
> >
> > Initial comments:
> >
> > * It is strange that you're seeing:
> >  
> >   > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
> >   > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node  
> > 10.1.21.83:4379  
> >   > is dead: 0 connected  
> >
> >   when disconnecting the client network.  This should only happen if
> >   the internal, private is disconnected.  Is your diagram accurate?  Is
> >   eth1 really a different physical interface?
> >
> > * The following tells us that glusterfs still seems to be working
> >   across both nodes:
> >  
> >   > 2017/05/26 21:47:59.240133 [recoverd: 3720]:  
> > server/ctdb_recoverd.c:1765  
> >   > Starting do_recovery
> >   > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock  
> > from  
> >   > recovery daemon
> >   > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
> >   > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock:  
> > Failed to  
> >   > get recovery lock on '/data/lock1/lockfile'
> >   > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery  
> > lock -  
> >   > aborting recovery and ban ourself for 300 seconds
> >   > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300  
> > seconds  
> >   > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds  
> >
> > * Sending a TCP tickle ACK in the following context should only happen
> >   on the takeover node:
> >  
> >   > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed  
> > to  
> >   > send tcp tickle ack for 10.10.31.151*  
> >
> >   The IP address "fail back" due to the above ban.
> >
> > The real question is why CTDB thinks a node goes away when you
> > disconnect the public/client network.
> >
> > peace & happiness,
> > martin
> >
> > On Wed, 31 May 2017 08:19:37 +0700, GiangCoi Mr via samba-technical
> > <samba-technical at lists.samba.org> wrote:
> >  
> > > Hi Team
> > > Please help me to fix this issue.
> > >
> > > Regards,
> > > Giang
> > >
> > >
> > > 2017-05-30 18:22 GMT+07:00 GiangCoi Mr <ltrgiang86 at gmail.com>:
> > >  
>  [...]  
> > when  
>  [...]  
> > (id  
>  [...]  
> > to  
>  [...]  
> > to  
>  [...]  
> > server/ctdb_recoverd.c:1139  
>  [...]  
> > Count : 0  
>  [...]  
> > with  
>  [...]  
> > server/ctdb_recoverd.c:3685  
>  [...]  
> > 10.1.21.83:4379  
>  [...]  
> > node :0  
>  [...]  
> > is  
>  [...]  
> > is  
>  [...]  
> > with  
>  [...]  
> > server/ctdb_recoverd.c:3685  
>  [...]  
> > server/ctdb_recoverd.c:1139  
>  [...]  
> > mode.  
>  [...]  
> > server/ctdb_recoverd.c:1765  
>  [...]  
> > from  
>  [...]  
> > Failed to  
>  [...]  
> > lock -  
>  [...]  
> > seconds  
>  [...]  
> > that  
>  [...]  
> > generate  
>  [...]  
> > for  
>  [...]  
> > clients  
>  [...]  
> > timeouts  
>  [...]  
> > in  
>  [...]  
> > Server 01  
>  [...]  
>  [...]  
> > >  [...]  
>  [...]  
> > >  [...]  
>  [...]  
> > >  [...]  
>  [...]  
>  [...]  
> >
> >  




More information about the samba-technical mailing list