Error in Setup File Server Cluster with Samba

Tue May 30 11:22:38 UTC 2017

Hi Amitay Isaacs

This is log.ctdb in File Server 01 when I disconnect eth0 in File 01 when
Client 10.1.31.151 (other subnet) is copying files to File Server

---------------------------
2017/05/26 21:47:55.991662 [ 3942]: Ending traverse on DB brlock.tdb (id
21785), records 0
2017/05/26 21:47:56.121928 [ 3942]: common/system_linux.c:364 failed sendto
(Network is unreachable)
*2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed to
send tcp tickle ack for 10.10.31.151*
*2017/05/26 21:47:57.234419 [ 3942]: common/system_linux.c:364 failed
sendto (Network is unreachable)*
*2017/05/26 21:47:57.234542 [ 3942]: server/ctdb_takeover.c:345 Failed to
send tcp tickle ack for 10.10.31.151*
2017/05/26 21:48:06.002174 [recoverd: 4129]: The rerecovery timeout has
elapsed. We now allow recoveries to trigger again.
2017/05/26 21:48:20.332193 [ 3942]: Could not find idr:21511
2017/05/26 21:48:20.332289 [ 3942]: pnn 0 Invalid reqid 21511 in
ctdb_reply_control
2017/05/26 21:48:23.334985 [recoverd: 4129]: server/ctdb_recoverd.c:1139
Election timed out
2017/05/26 21:48:24.788899 [ 3942]: 10.1.21.83:4379: connected to
10.1.21.117:4379 - 1 connected
2017/05/26 21:49:25.083127 [ 3942]: Recovery daemon ping timeout. Count : 0
2017/05/26 21:49:25.083446 [recoverd: 4129]: ctdb_control error:
'ctdb_control timed out'
2017/05/26 21:49:25.083546 [recoverd: 4129]: ctdb_control error:
'ctdb_control timed out'
2017/05/26 21:49:25.083579 [recoverd: 4129]: Async operation failed with
ret=-1 res=-1 opcode=80
2017/05/26 21:49:25.083596 [recoverd: 4129]: Async wait failed -
fail_count=1
2017/05/26 21:49:25.083613 [recoverd: 4129]: server/ctdb_recoverd.c:345
Failed to read node capabilities.
2017/05/26 21:49:25.083631 [recoverd: 4129]: server/ctdb_recoverd.c:3685
Unable to update node capabilities.
---------------------------------------------------------------------------------

And this is log.ctdb in File Server 02
------------------------------------------
2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node 10.1.21.83:4379
is dead: 0 connected
2017/05/26 21:47:56.227776 [ 3529]: Tearing down connection to dead node :0
2017/05/26 21:47:56.227853 [recoverd: 3720]: ctdb_control error: 'node is
disconnected'
2017/05/26 21:47:56.227870 [recoverd: 3720]: ctdb_control error: 'node is
disconnected'
2017/05/26 21:47:56.227887 [recoverd: 3720]: Async operation failed with
ret=-1 res=-1 opcode=80
2017/05/26 21:47:56.227892 [recoverd: 3720]: Async wait failed -
fail_count=1
2017/05/26 21:47:56.227895 [recoverd: 3720]: server/ctdb_recoverd.c:345
Failed to read node capabilities.
2017/05/26 21:47:56.227900 [recoverd: 3720]: server/ctdb_recoverd.c:3685
Unable to update node capabilities.
2017/05/26 21:47:56.228857 [recoverd: 3720]: Recmaster node 0 is
disconnected. Force reelection
2017/05/26 21:47:56.228930 [ 3529]: Freeze priority 1
2017/05/26 21:47:56.229955 [ 3529]: Freeze priority 2
2017/05/26 21:47:56.230859 [ 3529]: Freeze priority 3
2017/05/26 21:47:56.231524 [ 3529]: server/ctdb_recover.c:612 Recovery mode
set to ACTIVE
2017/05/26 21:47:56.231828 [ 3529]: This node (1) is now the recovery master
2017/05/26 21:47:59.236415 [recoverd: 3720]: server/ctdb_recoverd.c:1139
Election timed out
2017/05/26 21:47:59.240023 [recoverd: 3720]: Node:1 was in recovery mode.
Start recovery process
2017/05/26 21:47:59.240133 [recoverd: 3720]: server/ctdb_recoverd.c:1765
Starting do_recovery
2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock from
recovery daemon
2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock: Failed to
get recovery lock on '/data/lock1/lockfile'
2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery lock -
aborting recovery and ban ourself for 300 seconds
2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300 seconds
2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds

I read in ctdb.samba.org:

IP TakeoverWhen a node in a cluster fails, CTDB will arrange that a
different node takes over the IP address of the failed node to ensure that
the IP addresses for the services provided are always available.

To speed up the process of IP takeover and when clients attached to a
failed node recovers as fast as possible, CTDB will automatically generate
gratuitous ARP packets to inform all nodes of the changed MAC address for
that IP. CTDB will also send "tickle ACK" packets to all attached clients
to trigger the clients to immediately recognize that the TCP connection
needs to be re-established and to shortcut any TCP retransmission timeouts
that may be active in the clients.

I guess, CTDB in File server 02 have to send tickle ACK to Client, but in
this situation, File Server 01 send tickle ACK when eth0 in File Server 01
down.

And your question: Do you have any firewall on your Cisco router?

We don't have any firewall between 2 subnets. Thanks so much

Regards,

Giang

2017-05-30 17:42 GMT+07:00 Amitay Isaacs <amitay at gmail.com>:

>
> On Tue, May 30, 2017 at 4:17 PM, GiangCoi Mr via samba-technical <
> samba-technical at lists.samba.org> wrote:
>
>> Hi Team.
>> At the moment, I am installing File Server Samba Cluster as follow diagram
>>
>>
>>
>> File Server 01 and 02 connect to SAN 01 and SAN 02 through iSCSI. In both
>> File Server I install and configure GlusterFS to share folder /data for
>> shared files to everyone.
>>
>> File configure ctdb as follow
>>
>> vi /data/lock/ctdb
>>
>> CTDB_RECOVERY_LOCK=/data/lock/lockfile
>> #CIFS only
>> CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
>> CTDB_MANAGES_SAMBA=yes
>> #CIFS only
>> CTDB_NODES=/etc/ctdb/nodes
>>
>>
>> File public_address
>> vi /data/lock/public_addresses
>> 10.1.1.202/24 eth0
>>
>> File nodes
>> vi /data/lock/nodes
>> 172.16.0.1
>> 172.16.0.2
>>
>>
>> File smb.conf
>>
>> cp /etc/samba/smb.conf /data/lock/smb.conf
>>
>> vi /data/lock/smb.conf
>>
>> clustering = yes
>> idmap backend = tdb2
>> private dir = /data/lock
>>
>>
>>
>> [share]
>>
>> comment = Gluster and CTDB based share
>> path = /data/share
>> read only = no
>> writable = yes
>> valid users = jon
>>
>> I create soft link
>> ln -s /data/lock/ctdb /etc/sysconfig/ctdb
>>
>> ln -s /data/lock/nodes /etc/ctdb/nodes
>> ln -s /data/lock/public_addresses /etc/ctdb/public_addresses
>>
>> ln -s /data/lock/smb.conf /etc/samba/
>>
>>
>> Everything I setup, It's ok. When I test network down with 2 Client, I
>> disconnect eth0 in File Server 01, then
>>
>> - Client 01: 10.1.15.200 (in other subnet) is copying data to File Server
>> 01 interrupted and the Client 01 have lost connection to File Server. I
>> saw
>> log from log.ctdb: "sendto failed, don't send tickle ACK to IP
>> 10.1.15.200"
>>
>
> Can you paste the exact entry from CTDB's log?
>
> Also, set debug level to NOTICE in ctdb configuration.
> CTDB_LOGLEVEL=NOTICE
>
>
>>  - Client 03: 10.1.1.210 (in same subnet) is copying data to File Server
>> 01
>> normally.
>>
>>
>> I am using Cisco Layer 3 to routing inter VLAN.
>>
>
> Do you have any firewall on your Cisco router?
>
>
>>
>>
>> So, how I do to fix this issue?
>>
>>
>> Regards,
>>
>> Giang
>>
>
> Amitay.
>