Error in Setup File Server Cluster with Samba

Wed May 31 01:48:23 UTC 2017

Hi Giang,

Can you please let us know what CTDB version you're using? 

Can you please also run with a higher debug level (as Amitay
requested) so we get more context for what is happening?

Initial comments:

* It is strange that you're seeing:

  > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
  > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node 10.1.21.83:4379
  > is dead: 0 connected

  when disconnecting the client network.  This should only happen if
  the internal, private is disconnected.  Is your diagram accurate?  Is
  eth1 really a different physical interface?

* The following tells us that glusterfs still seems to be working
  across both nodes:

  > 2017/05/26 21:47:59.240133 [recoverd: 3720]: server/ctdb_recoverd.c:1765
  > Starting do_recovery
  > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock from
  > recovery daemon
  > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
  > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock: Failed to
  > get recovery lock on '/data/lock1/lockfile'
  > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery lock -
  > aborting recovery and ban ourself for 300 seconds
  > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300 seconds
  > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds

* Sending a TCP tickle ACK in the following context should only happen
  on the takeover node:

  > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed to
  > send tcp tickle ack for 10.10.31.151*

  The IP address "fail back" due to the above ban.

The real question is why CTDB thinks a node goes away when you
disconnect the public/client network.

peace & happiness,
martin

On Wed, 31 May 2017 08:19:37 +0700, GiangCoi Mr via samba-technical
<samba-technical at lists.samba.org> wrote:

> Hi Team
> Please help me to fix this issue.
> 
> Regards,
> Giang
> 
> 
> 2017-05-30 18:22 GMT+07:00 GiangCoi Mr <ltrgiang86 at gmail.com>:
> 
> > Hi Amitay Isaacs
> >
> > This is log.ctdb in File Server 01 when I disconnect eth0 in File 01 when
> > Client 10.1.31.151 (other subnet) is copying files to File Server
> >
> > ---------------------------
> > 2017/05/26 21:47:55.991662 [ 3942]: Ending traverse on DB brlock.tdb (id
> > 21785), records 0
> > 2017/05/26 21:47:56.121928 [ 3942]: common/system_linux.c:364 failed
> > sendto (Network is unreachable)
> > *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed to
> > send tcp tickle ack for 10.10.31.151*
> > *2017/05/26 21:47:57.234419 [ 3942]: common/system_linux.c:364 failed
> > sendto (Network is unreachable)*
> > *2017/05/26 21:47:57.234542 [ 3942]: server/ctdb_takeover.c:345 Failed to
> > send tcp tickle ack for 10.10.31.151*
> > 2017/05/26 21:48:06.002174 [recoverd: 4129]: The rerecovery timeout has
> > elapsed. We now allow recoveries to trigger again.
> > 2017/05/26 21:48:20.332193 [ 3942]: Could not find idr:21511
> > 2017/05/26 21:48:20.332289 [ 3942]: pnn 0 Invalid reqid 21511 in
> > ctdb_reply_control
> > 2017/05/26 21:48:23.334985 [recoverd: 4129]: server/ctdb_recoverd.c:1139
> > Election timed out
> > 2017/05/26 21:48:24.788899 [ 3942]: 10.1.21.83:4379: connected to
> > 10.1.21.117:4379 - 1 connected
> > 2017/05/26 21:49:25.083127 [ 3942]: Recovery daemon ping timeout. Count : 0
> > 2017/05/26 21:49:25.083446 [recoverd: 4129]: ctdb_control error:
> > 'ctdb_control timed out'
> > 2017/05/26 21:49:25.083546 [recoverd: 4129]: ctdb_control error:
> > 'ctdb_control timed out'
> > 2017/05/26 21:49:25.083579 [recoverd: 4129]: Async operation failed with
> > ret=-1 res=-1 opcode=80
> > 2017/05/26 21:49:25.083596 [recoverd: 4129]: Async wait failed -
> > fail_count=1
> > 2017/05/26 21:49:25.083613 [recoverd: 4129]: server/ctdb_recoverd.c:345
> > Failed to read node capabilities.
> > 2017/05/26 21:49:25.083631 [recoverd: 4129]: server/ctdb_recoverd.c:3685
> > Unable to update node capabilities.
> > ------------------------------------------------------------
> > ---------------------
> >
> > And this is log.ctdb in File Server 02
> > ------------------------------------------
> > 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
> > 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node 10.1.21.83:4379
> > is dead: 0 connected
> > 2017/05/26 21:47:56.227776 [ 3529]: Tearing down connection to dead node :0
> > 2017/05/26 21:47:56.227853 [recoverd: 3720]: ctdb_control error: 'node is
> > disconnected'
> > 2017/05/26 21:47:56.227870 [recoverd: 3720]: ctdb_control error: 'node is
> > disconnected'
> > 2017/05/26 21:47:56.227887 [recoverd: 3720]: Async operation failed with
> > ret=-1 res=-1 opcode=80
> > 2017/05/26 21:47:56.227892 [recoverd: 3720]: Async wait failed -
> > fail_count=1
> > 2017/05/26 21:47:56.227895 [recoverd: 3720]: server/ctdb_recoverd.c:345
> > Failed to read node capabilities.
> > 2017/05/26 21:47:56.227900 [recoverd: 3720]: server/ctdb_recoverd.c:3685
> > Unable to update node capabilities.
> > 2017/05/26 21:47:56.228857 [recoverd: 3720]: Recmaster node 0 is
> > disconnected. Force reelection
> > 2017/05/26 21:47:56.228930 [ 3529]: Freeze priority 1
> > 2017/05/26 21:47:56.229955 [ 3529]: Freeze priority 2
> > 2017/05/26 21:47:56.230859 [ 3529]: Freeze priority 3
> > 2017/05/26 21:47:56.231524 [ 3529]: server/ctdb_recover.c:612 Recovery
> > mode set to ACTIVE
> > 2017/05/26 21:47:56.231828 [ 3529]: This node (1) is now the recovery
> > master
> > 2017/05/26 21:47:59.236415 [recoverd: 3720]: server/ctdb_recoverd.c:1139
> > Election timed out
> > 2017/05/26 21:47:59.240023 [recoverd: 3720]: Node:1 was in recovery mode.
> > Start recovery process
> > 2017/05/26 21:47:59.240133 [recoverd: 3720]: server/ctdb_recoverd.c:1765
> > Starting do_recovery
> > 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock from
> > recovery daemon
> > 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
> > 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock: Failed to
> > get recovery lock on '/data/lock1/lockfile'
> > 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery lock -
> > aborting recovery and ban ourself for 300 seconds
> > 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300 seconds
> > 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds
> >
> >
> > I read in ctdb.samba.org:
> >
> > IP TakeoverWhen a node in a cluster fails, CTDB will arrange that a
> > different node takes over the IP address of the failed node to ensure that
> > the IP addresses for the services provided are always available.
> >
> > To speed up the process of IP takeover and when clients attached to a
> > failed node recovers as fast as possible, CTDB will automatically generate
> > gratuitous ARP packets to inform all nodes of the changed MAC address for
> > that IP. CTDB will also send "tickle ACK" packets to all attached clients
> > to trigger the clients to immediately recognize that the TCP connection
> > needs to be re-established and to shortcut any TCP retransmission timeouts
> > that may be active in the clients.
> >
> > I guess, CTDB in File server 02 have to send tickle ACK to Client, but in
> > this situation, File Server 01 send tickle ACK when eth0 in File Server 01
> > down.
> >
> > And your question: Do you have any firewall on your Cisco router?
> >
> > We don't have any firewall between 2 subnets. Thanks so much
> >
> > Regards,
> >
> > Giang
> >
> >
> >
> >
> > 2017-05-30 17:42 GMT+07:00 Amitay Isaacs <amitay at gmail.com>:
> >  
> >>
> >> On Tue, May 30, 2017 at 4:17 PM, GiangCoi Mr via samba-technical <
> >> samba-technical at lists.samba.org> wrote:
> >>  
>  [...]  
> >>
> >> Can you paste the exact entry from CTDB's log?
> >>
> >> Also, set debug level to NOTICE in ctdb configuration.
> >> CTDB_LOGLEVEL=NOTICE
> >>
> >>  
>  [...]  
> >>
> >> Do you have any firewall on your Cisco router?
> >>
> >>  
>  [...]  
> >>
> >> Amitay.
> >>  
> >
> >