Error in Setup File Server Cluster with Samba

Wed May 31 13:12:25 UTC 2017

Dear Martin

In my diagram. I forgot to figure eth2 for CTDB and eth3 for GlusterFS but
in real. I use private network (172.16.0.0/24) for CTDB

Because I am using vmware workstation for testing File Cluster, so I only
way: use ifdown eth0 to test. I will test in real network in my company for
this situation. By the way, can you give me instruction to integrated File
Cluster Samba to Window Active Directory for authenticating user. Thanks so
much

Regards,
Giang

2017-05-31 18:15 GMT+07:00 Martin Schwenke <martin at meltin.net>:

> Hi Giang,
>
> I can see 2 problems:
>
> 1. You don't seem to have a separate private/internal network for
>    internal CTDB communications.
>
>    In your original message you said that the CTDB nodes configuration
>    was:
>
>      File nodes
>      vi /data/lock/nodes
>      172.16.0.1
>      172.16.0.2
>
>    In your latest diagram I don't see any special interfaces for these
>    networks.  It looks like traffic between these addresses is going
>    via the default route, which uses eth0.  If you take eth0 down then
>    CTDB can not communicate between the 2 nodes.
>
> 2. "ifdown eth0" is not a valid test.
>
>    This does not represent a real fault that is likely to occur.  CTDB
>    tests for link on an interface (using ethtool), because link failure
>    is the most likely fault.  In fact, CTDB forces the interface state
>    up when monitoring an interface (using "ip link set up dev X", so
>    this does not configure addresses removed by "ifdown X").
>
> I also find the overlap in the networks between eth0 and eth1 to be
> confusing.  I guess that this might not cause problems...
>
> I hope this helps...
>
> peace & happiness,
> martin
>
> On Wed, 31 May 2017 15:21:08 +0700, GiangCoi Mr <ltrgiang86 at gmail.com>
> wrote:
>
> > Hi Martin Schwenke, Amitay Issacs
> >
> > This is my diagram
> >
> > 
> > Route in File 01:
> >  - eth1(10.1.21.84) only connect to SAN1 eth0 (10.1.21.86)
> >  - eth0(10.1.21.83) will connect to Client
> > ------------------------
> > [root at file1 ~]# route -n
> > Kernel IP routing table
> > Destination     Gateway         Genmask         Flags Metric Ref    Use
> > Iface
> > *10.1.21.86      10.1.21.84      255.255.255.255 UGH   0      0        0
> > eth1*
> > 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0
> eth0
> > 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0
> eth1
> > 172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0
> eth2
> > 172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0
> eth3
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0
> eth0
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0
> eth1
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0
> eth2
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0
> eth3
> > *0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
> > eth0*
> > ------------------------------
> >
> > Route in File 02:
> >  - eth1(10.1.21.82) only connect to SAN2 eth0 (10.1.21.87)
> >  - eth0(10.1.21.117) will connect to Client
> > -------------------------
> > [root at file2 ~]# route -n
> > Kernel IP routing table
> > Destination     Gateway         Genmask         Flags Metric Ref    Use
> > Iface
> > *10.1.21.87      10.1.21.82      255.255.255.255 UGH   0      0        0
> > eth1*
> > 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0
> eth0
> > 10.1.21.0       0.0.0.0         255.255.255.0   U     0      0        0
> eth1
> > 172.17.2.0      0.0.0.0         255.255.255.0   U     0      0        0
> eth2
> > 172.17.3.0      0.0.0.0         255.255.255.0   U     0      0        0
> eth3
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0
> eth0
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0
> eth1
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0
> eth2
> > 169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0
> eth3
> > *0.0.0.0         10.1.21.1       0.0.0.0         UG    0      0        0
> > eth0*
> > -----------------------------------
> > eth1 in File 01 and File 02 only connect to SAN, Client cannot connect to
> > eth1 in both server
> >
> > ********************************************
> > In file ctdb, I configured
> > *CTDB_LOGLEVEL=DEBUG*
> >
> > *********************************************
> > Client01(10.1.31.151) is connecting and copying to File server Cluster
> > through node 0(File 01: 10.1.21.83). After I make command: "*ifdown
> eth0*"
> > in File 01, then Client01 had disconnected to File server and cannot copy
> > file.
> > This is ctdb status in File 02, this status is BANNED
> > ------------------------------------
> > Number of nodes:2
> > pnn:0 10.1.21.83       DISCONNECTED|UNHEALTHY|INACTIVE
> > pnn:1 10.1.21.117      *BANNED*|*INACTIVE (THIS NODE)*
> > Generation:INVALID
> > Size:2
> > hash:0 lmaster:0
> > hash:1 lmaster:1
> > Recovery mode:RECOVERY (1)
> > Recovery master:1
> > ------------------------------------
> >
> > *********************************************
> > And new issue: I ping from my client to eth0 in File 01 (10.1.21.83),
> some
> > packet timeout --> It's not stability. I must restart network
> > I attach 2 file log.ctdb in both File Server. Please help me to fix it.
> > Thanks so much
> >
> > Regards,
> > Giang
> >
> >
> > 2017-05-31 8:48 GMT+07:00 Martin Schwenke <martin at meltin.net>:
> >
> > > Hi Giang,
> > >
> > > Can you please let us know what CTDB version you're using?
> > >
> > > Can you please also run with a higher debug level (as Amitay
> > > requested) so we get more context for what is happening?
> > >
> > > Initial comments:
> > >
> > > * It is strange that you're seeing:
> > >
> > >   > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
> > >   > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node
> > > 10.1.21.83:4379
> > >   > is dead: 0 connected
> > >
> > >   when disconnecting the client network.  This should only happen if
> > >   the internal, private is disconnected.  Is your diagram accurate?  Is
> > >   eth1 really a different physical interface?
> > >
> > > * The following tells us that glusterfs still seems to be working
> > >   across both nodes:
> > >
> > >   > 2017/05/26 21:47:59.240133 [recoverd: 3720]:
> > > server/ctdb_recoverd.c:1765
> > >   > Starting do_recovery
> > >   > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery
> lock
> > > from
> > >   > recovery daemon
> > >   > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
> > >   > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock:
> > > Failed to
> > >   > get recovery lock on '/data/lock1/lockfile'
> > >   > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery
> > > lock -
> > >   > aborting recovery and ban ourself for 300 seconds
> > >   > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300
> > > seconds
> > >   > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300
> seconds
> > >
> > > * Sending a TCP tickle ACK in the following context should only happen
> > >   on the takeover node:
> > >
> > >   > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345
> Failed
> > > to
> > >   > send tcp tickle ack for 10.10.31.151*
> > >
> > >   The IP address "fail back" due to the above ban.
> > >
> > > The real question is why CTDB thinks a node goes away when you
> > > disconnect the public/client network.
> > >
> > > peace & happiness,
> > > martin
> > >
> > > On Wed, 31 May 2017 08:19:37 +0700, GiangCoi Mr via samba-technical
> > > <samba-technical at lists.samba.org> wrote:
> > >
> > > > Hi Team
> > > > Please help me to fix this issue.
> > > >
> > > > Regards,
> > > > Giang
> > > >
> > > >
> > > > 2017-05-30 18:22 GMT+07:00 GiangCoi Mr <ltrgiang86 at gmail.com>:
> > > >
> >  [...]
> > > when
> >  [...]
> > > (id
> >  [...]
> > > to
> >  [...]
> > > to
> >  [...]
> > > server/ctdb_recoverd.c:1139
> >  [...]
> > > Count : 0
> >  [...]
> > > with
> >  [...]
> > > server/ctdb_recoverd.c:3685
> >  [...]
> > > 10.1.21.83:4379
> >  [...]
> > > node :0
> >  [...]
> > > is
> >  [...]
> > > is
> >  [...]
> > > with
> >  [...]
> > > server/ctdb_recoverd.c:3685
> >  [...]
> > > server/ctdb_recoverd.c:1139
> >  [...]
> > > mode.
> >  [...]
> > > server/ctdb_recoverd.c:1765
> >  [...]
> > > from
> >  [...]
> > > Failed to
> >  [...]
> > > lock -
> >  [...]
> > > seconds
> >  [...]
> > > that
> >  [...]
> > > generate
> >  [...]
> > > for
> >  [...]
> > > clients
> >  [...]
> > > timeouts
> >  [...]
> > > in
> >  [...]
> > > Server 01
> >  [...]
> >  [...]
> > > >  [...]
> >  [...]
> > > >  [...]
> >  [...]
> > > >  [...]
> >  [...]
> >  [...]
> > >
> > >
>
>