Error in Setup File Server Cluster with Samba

GiangCoi Mr ltrgiang86 at gmail.com
Wed May 31 01:19:37 UTC 2017


Hi Team
Please help me to fix this issue.

Regards,
Giang


2017-05-30 18:22 GMT+07:00 GiangCoi Mr <ltrgiang86 at gmail.com>:

> Hi Amitay Isaacs
>
> This is log.ctdb in File Server 01 when I disconnect eth0 in File 01 when
> Client 10.1.31.151 (other subnet) is copying files to File Server
>
> ---------------------------
> 2017/05/26 21:47:55.991662 [ 3942]: Ending traverse on DB brlock.tdb (id
> 21785), records 0
> 2017/05/26 21:47:56.121928 [ 3942]: common/system_linux.c:364 failed
> sendto (Network is unreachable)
> *2017/05/26 21:47:56.122063 [ 3942]: server/ctdb_takeover.c:345 Failed to
> send tcp tickle ack for 10.10.31.151*
> *2017/05/26 21:47:57.234419 [ 3942]: common/system_linux.c:364 failed
> sendto (Network is unreachable)*
> *2017/05/26 21:47:57.234542 [ 3942]: server/ctdb_takeover.c:345 Failed to
> send tcp tickle ack for 10.10.31.151*
> 2017/05/26 21:48:06.002174 [recoverd: 4129]: The rerecovery timeout has
> elapsed. We now allow recoveries to trigger again.
> 2017/05/26 21:48:20.332193 [ 3942]: Could not find idr:21511
> 2017/05/26 21:48:20.332289 [ 3942]: pnn 0 Invalid reqid 21511 in
> ctdb_reply_control
> 2017/05/26 21:48:23.334985 [recoverd: 4129]: server/ctdb_recoverd.c:1139
> Election timed out
> 2017/05/26 21:48:24.788899 [ 3942]: 10.1.21.83:4379: connected to
> 10.1.21.117:4379 - 1 connected
> 2017/05/26 21:49:25.083127 [ 3942]: Recovery daemon ping timeout. Count : 0
> 2017/05/26 21:49:25.083446 [recoverd: 4129]: ctdb_control error:
> 'ctdb_control timed out'
> 2017/05/26 21:49:25.083546 [recoverd: 4129]: ctdb_control error:
> 'ctdb_control timed out'
> 2017/05/26 21:49:25.083579 [recoverd: 4129]: Async operation failed with
> ret=-1 res=-1 opcode=80
> 2017/05/26 21:49:25.083596 [recoverd: 4129]: Async wait failed -
> fail_count=1
> 2017/05/26 21:49:25.083613 [recoverd: 4129]: server/ctdb_recoverd.c:345
> Failed to read node capabilities.
> 2017/05/26 21:49:25.083631 [recoverd: 4129]: server/ctdb_recoverd.c:3685
> Unable to update node capabilities.
> ------------------------------------------------------------
> ---------------------
>
> And this is log.ctdb in File Server 02
> ------------------------------------------
> 2017/05/26 21:47:56.227659 [ 3529]: dead count reached for node 0
> 2017/05/26 21:47:56.227721 [ 3529]: 10.1.21.117:4379: node 10.1.21.83:4379
> is dead: 0 connected
> 2017/05/26 21:47:56.227776 [ 3529]: Tearing down connection to dead node :0
> 2017/05/26 21:47:56.227853 [recoverd: 3720]: ctdb_control error: 'node is
> disconnected'
> 2017/05/26 21:47:56.227870 [recoverd: 3720]: ctdb_control error: 'node is
> disconnected'
> 2017/05/26 21:47:56.227887 [recoverd: 3720]: Async operation failed with
> ret=-1 res=-1 opcode=80
> 2017/05/26 21:47:56.227892 [recoverd: 3720]: Async wait failed -
> fail_count=1
> 2017/05/26 21:47:56.227895 [recoverd: 3720]: server/ctdb_recoverd.c:345
> Failed to read node capabilities.
> 2017/05/26 21:47:56.227900 [recoverd: 3720]: server/ctdb_recoverd.c:3685
> Unable to update node capabilities.
> 2017/05/26 21:47:56.228857 [recoverd: 3720]: Recmaster node 0 is
> disconnected. Force reelection
> 2017/05/26 21:47:56.228930 [ 3529]: Freeze priority 1
> 2017/05/26 21:47:56.229955 [ 3529]: Freeze priority 2
> 2017/05/26 21:47:56.230859 [ 3529]: Freeze priority 3
> 2017/05/26 21:47:56.231524 [ 3529]: server/ctdb_recover.c:612 Recovery
> mode set to ACTIVE
> 2017/05/26 21:47:56.231828 [ 3529]: This node (1) is now the recovery
> master
> 2017/05/26 21:47:59.236415 [recoverd: 3720]: server/ctdb_recoverd.c:1139
> Election timed out
> 2017/05/26 21:47:59.240023 [recoverd: 3720]: Node:1 was in recovery mode.
> Start recovery process
> 2017/05/26 21:47:59.240133 [recoverd: 3720]: server/ctdb_recoverd.c:1765
> Starting do_recovery
> 2017/05/26 21:47:59.240161 [recoverd: 3720]: Taking out recovery lock from
> recovery daemon
> 2017/05/26 21:47:59.240182 [recoverd: 3720]: Take the recovery lock
> 2017/05/26 21:47:59.249344 [recoverd: 3720]: ctdb_recovery_lock: Failed to
> get recovery lock on '/data/lock1/lockfile'
> 2017/05/26 21:47:59.249486 [recoverd: 3720]: Unable to get recovery lock -
> aborting recovery and ban ourself for 300 seconds
> 2017/05/26 21:47:59.249517 [recoverd: 3720]: Banning node 1 for 300 seconds
> 2017/05/26 21:47:59.249727 [ 3529]: Banning this node for 300 seconds
>
>
> I read in ctdb.samba.org:
>
> IP TakeoverWhen a node in a cluster fails, CTDB will arrange that a
> different node takes over the IP address of the failed node to ensure that
> the IP addresses for the services provided are always available.
>
> To speed up the process of IP takeover and when clients attached to a
> failed node recovers as fast as possible, CTDB will automatically generate
> gratuitous ARP packets to inform all nodes of the changed MAC address for
> that IP. CTDB will also send "tickle ACK" packets to all attached clients
> to trigger the clients to immediately recognize that the TCP connection
> needs to be re-established and to shortcut any TCP retransmission timeouts
> that may be active in the clients.
>
> I guess, CTDB in File server 02 have to send tickle ACK to Client, but in
> this situation, File Server 01 send tickle ACK when eth0 in File Server 01
> down.
>
> And your question: Do you have any firewall on your Cisco router?
>
> We don't have any firewall between 2 subnets. Thanks so much
>
> Regards,
>
> Giang
>
>
>
>
> 2017-05-30 17:42 GMT+07:00 Amitay Isaacs <amitay at gmail.com>:
>
>>
>> On Tue, May 30, 2017 at 4:17 PM, GiangCoi Mr via samba-technical <
>> samba-technical at lists.samba.org> wrote:
>>
>>> Hi Team.
>>> At the moment, I am installing File Server Samba Cluster as follow
>>> diagram
>>>
>>>
>>>
>>> File Server 01 and 02 connect to SAN 01 and SAN 02 through iSCSI. In both
>>> File Server I install and configure GlusterFS to share folder /data for
>>> shared files to everyone.
>>>
>>> ​File configure ctdb as follow
>>>
>>> vi /data/lock/ctdb
>>>
>>> CTDB_RECOVERY_LOCK=/data/lock/lockfile
>>> #CIFS only
>>> CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
>>> CTDB_MANAGES_SAMBA=yes
>>> #CIFS only
>>> CTDB_NODES=/etc/ctdb/nodes
>>>
>>>
>>> File public_address
>>> vi /data/lock/public_addresses
>>> 10.1.1.202/24 eth0
>>>
>>> File nodes
>>> vi /data/lock/nodes
>>> 172.16.0.1
>>> 172.16.0.2
>>>
>>>
>>> File smb.conf
>>>
>>> cp /etc/samba/smb.conf /data/lock/smb.conf
>>>
>>> vi /data/lock/smb.conf
>>>
>>> clustering = yes
>>> idmap backend = tdb2
>>> private dir = /data/lock
>>>
>>>
>>>
>>> [share]
>>>
>>> comment = Gluster and CTDB based share
>>> path = /data/share
>>> read only = no
>>> writable = yes
>>> valid users = jon
>>>
>>> I create soft link
>>> ln -s /data/lock/ctdb /etc/sysconfig/ctdb
>>>
>>> ln -s /data/lock/nodes /etc/ctdb/nodes
>>> ln -s /data/lock/public_addresses /etc/ctdb/public_addresses
>>>
>>> ln -s /data/lock/smb.conf /etc/samba/
>>>
>>>
>>> Everything I setup, It's ok. When I test network down with 2 Client, I
>>> disconnect eth0 in File Server 01, then
>>>
>>> - Client 01: 10.1.15.200 (in other subnet) is copying data to File Server
>>> 01 interrupted and the Client 01 have lost connection to File Server. I
>>> saw
>>> log from log.ctdb: "sendto failed, don't send tickle ACK to IP
>>> 10.1.15.200"
>>>
>>
>> Can you paste the exact entry from CTDB's log?
>>
>> Also, set debug level to NOTICE in ctdb configuration.
>> CTDB_LOGLEVEL=NOTICE
>>
>>
>>>  - Client 03: 10.1.1.210 (in same subnet) is copying data to File Server
>>> 01
>>> normally.
>>>
>>>
>>> I am using Cisco Layer 3 to routing inter VLAN.
>>>
>>
>> Do you have any firewall on your Cisco router?
>>
>>
>>>
>>>
>>> So, how I do to fix this issue?
>>>
>>>
>>> Regards,
>>>
>>> Giang
>>>
>>
>> Amitay.
>>
>
>


More information about the samba-technical mailing list