[Samba] ctdb_client.c control timed out - banning nodes

Tue May 19 06:54:25 MDT 2015

Hello,

We are using CTDB / Samba to serve a number of windows users, at this point around 1200.  We have a 4 node CTDB setup.

CTDB version - ctdb-1.0.114.7-1
Samba Version - sernet-samba-4.1.16-10

In recent months we've seen a big problem when 1 of the CTDB nodes is stopped or disconnected either manually or resulting from a problem.  On some occasions, all other nodes get banned if a node is stopped; sometimes it's just 1 or 2 of the other nodes but it always happens to at least 1.

In the logs the problems always start with "control timed out". See below:

2015/05/18 14:24:23.375195 [ 6031]: client/ctdb_client.c:759 control timed out. reqid:129 opcode:43 dstnode:0
2015/05/18 14:24:23.375232 [ 6031]: client/ctdb_client.c:870 ctdb_control_recv failed
2015/05/18 14:24:23.375240 [ 6031]: Async operation failed with state 3, opcode:0
2015/05/18 14:24:23.375257 [ 6031]: Async wait failed - fail_count=1
2015/05/18 14:24:23.375263 [ 6031]: server/ctdb_takeover.c:1475 Async control CTDB_CONTROL_RELEASE_IP failed.

We saw a different pattern yesterday when the smb.conf was edited ahead of the path being created and node 0 became unhealthy for a moment.

Node 0:
2015/05/18 14:24:17.694058 [ 5972]: ERROR: samba directory /path/to/directory" not available
2015/05/18 14:24:17.694435 [ 5972]: Node became UNHEALTHY. Ask recovery master 0 to perform ip reallocation
2015/05/18 14:24:23.375195 [ 6031]: client/ctdb_client.c:759 control timed out. reqid:129 opcode:43 dstnode:0
2015/05/18 14:24:23.375232 [ 6031]: client/ctdb_client.c:870 ctdb_control_recv failed
2015/05/18 14:24:23.375240 [ 6031]: Async operation failed with state 3, opcode:0
2015/05/18 14:24:23.375257 [ 6031]: Async wait failed - fail_count=1
2015/05/18 14:24:23.375263 [ 6031]: server/ctdb_takeover.c:1475 Async control CTDB_CONTROL_RELEASE_IP failed
2015/05/18 14:24:23.375272 [ 6031]: Failed to read public ips from remote node 0
2015/05/18 14:24:23.376281 [ 6031]: Inconsistent ip allocation. Trigger reallocation.
2015/05/18 14:24:23.376291 [ 6031]: Node 1 has inconsistent public ip allocation and needs update.
2015/05/18 14:24:23.697069 [ 5972]: killed 4 TCP connections to released IP 155.198.63.147
2015/05/18 14:24:24.534101 [ 5972]: Timed out killing tcp connections for IP 155.198.63.147
2015/05/18 14:24:24.541018 [ 5972]: RTNETLINK answers: Cannot assign requested address
2015/05/18 14:24:24.541916 [ 5972]: Failed to del <ip address> on dev bond2:1
2015/05/18 14:24:24.543983 [ 6031]: Could not find idr:129
2015/05/18 14:24:24.544008 [ 6031]: client/ctdb_client.c:706 reqid 129 not found
2015/05/18 14:24:25.156512 [ 6031]: Trigger takeoverrun
2015/05/18 14:24:25.781969 [ 6031]: Async operation failed with ret=0 res=-1 opcode=0
2015/05/18 14:24:25.782002 [ 6031]: Async operation failed with ret=0 res=-1 opcode=0
2015/05/18 14:24:25.782009 [ 6031]: Async wait failed - fail_count=2
2015/05/18 14:24:25.782015 [ 6031]: server/ctdb_takeover.c:1523 Async control CTDB_CONTROL_TAKEOVER_IP failed

Node 1:
2015/05/18 14:24:24.919614 [29039]: common/system_linux.c:91 failed to open raw socket
2015/05/18 14:24:24.919650 [29039]: server/ctdb_takeover.c:2667 sending of gratious arp on iface 'bond2:1' failed (Too many open files)
2015/05/18 14:24:25.025629 [29039]: Failed to create pipe for vacuum child process.
2015/05/18 14:24:25.052609 [29039]: common/system_linux.c:91 failed to open raw socket
2015/05/18 14:24:25.052634 [29039]: server/ctdb_takeover.c:240 sending of arp failed on iface 'bond2:1' (Too many open files)
2015/05/18 14:24:25.052656 [29039]: common/system_linux.c:342 failed to open raw socket (Too many open files)
2015/05/18 14:24:25.052664 [29039]: server/ctdb_takeover.c:259 Failed to send tcp tickle ack for <ip address>
....repeats for every client
2015/05/18 14:24:38.389204 [29039]: INTERNAL ERROR: Signal 6 in ctdbd pid 290392015/05/18 14:24:38.389209 [29039]:
Please read the file BUGS.txt in the distribution

Node 2:
2015/05/18 14:24:38.389035 [12968]: Freeze priority 1
2015/05/18 14:24:39.235547 [13021]: ctdb_control error: 'node is disconnected'
2015/05/18 14:24:39.235572 [13021]: ctdb_control error: 'node is disconnected'
2015/05/18 14:24:39.235583 [13021]: Async operation failed with ret=-1 res=-1 opcode=80
2015/05/18 14:24:39.235590 [13021]: Async wait failed - fail_count=1
2015/05/18 14:24:39.235598 [13021]: server/ctdb_recoverd.c:251 Failed to read node capabilities.
2015/05/18 14:24:39.235605 [13021]: server/ctdb_recoverd.c:3025 Unable to update node capabilities.
2015/05/18 14:24:59.405500 [12968]: Freeze priority 1
2015/05/18 14:25:03.257579 [12968]: Could not add client IP 155.198.17.133. This is not a public address.
2015/05/18 14:25:20.427135 [12968]: Banning this node for 300 seconds
2015/05/18 14:25:20.427216 [12968]: Freeze priority 1
2015/05/18 14:25:20.427231 [12968]: Freeze priority 2
2015/05/18 14:25:20.429833 [12968]: Freeze priority 3
2015/05/18 14:25:20.565797 [12968]: Killing TCP connection <ip address>:59613 <ip address>:445

.....repeats killing all TCP connections .....

2015/05/18 14:25:21.331435 [13021]: Node is stopped or banned but recovery mode is not active. Activate recovery mode and lock databases
2015/05/18 14:25:21.331517 [12968]: Freeze priority 1
2015/05/18 14:25:29.422120 [12968]: Timed out killing tcp connections for IP 155.198.63.145
2015/05/18 14:25:29.617412 [12968]: lib/events/events_standard.c:593 ERROR Time jumped forward by 9 seconds
2015/05/18 14:26:29.712507 [12968]: server/ctdb_recover.c:582 Been in recovery mode for too long. Dropping all IPS
2015/05/18 14:30:20.428137 [12968]: Banning timedout

Node 3:
2015/05/18 14:24:38.389263 [31822]: Freeze priority 1
2015/05/18 14:24:39.245580 [32106]: ctdb_control error: 'node is disconnected'
2015/05/18 14:24:39.245616 [32106]: ctdb_control error: 'node is disconnected'
2015/05/18 14:24:39.245628 [32106]: Async operation failed with ret=-1 res=-1 opcode=80
2015/05/18 14:24:39.245636 [32106]: Async wait failed - fail_count=1
2015/05/18 14:24:39.245644 [32106]: server/ctdb_recoverd.c:251 Failed to read node capabilities.
2015/05/18 14:24:39.245651 [32106]: server/ctdb_recoverd.c:3025 Unable to update node capabilities.
2015/05/18 14:24:59.405728 [31822]: Freeze priority 1
2015/05/18 14:25:29.617999 [31822]: Banning this node for 300 seconds

At this point the samba logs produce:

Node 1:
[2015/05/18 14:25:39.313788,  0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message)
  Got cluster reconfigure message

Node 2:
...Selection of errors
[2015/05/18 14:24:39.234057,  0] ../source3/lib/ctdbd_conn.c:401(ctdb_read_req)
  packet_fd_read failed: NT_STATUS_END_OF_FILE
[2015/05/18 14:24:39.234215,  0] ../source3/lib/ctdbd_conn.c:86(cluster_fatal)
  cluster fatal event: ctdbd died
[2015/05/18 14:24:39.234146,  0] ../source3/lib/ctdbd_conn.c:401(ctdb_read_req)
   - exiting immediately
[2015/05/18 14:24:39.234524,  0] ../source3/lib/ctdbd_conn.c:86(cluster_fatal)
   - exiting immediately
[2015/05/18 14:24:39.235395,  0] ../source3/lib/ctdbd_conn.c:725(ctdbd_socket_handler)
packet_fd_read failed: NT_STATUS_END_OF_FILE
...Errors above repeat 100s of times

Node 3:
[2015/05/18 14:25:23.071086,  0] ../source3/lib/util.c:477(reinit_after_fork)
  messaging_reinit() failed: NT_STATUS_INTERNAL_DB_ERROR
[2015/05/18 14:30:32.345565,  0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message)
  Got cluster reconfigure message

Node 4:
[2015/05/18 14:25:29.768514,  0] ../source3/lib/util.c:477(reinit_after_fork)
  messaging_reinit() failed: NT_STATUS_INTERNAL_DB_ERROR
[2015/05/18 14:30:59.052518,  0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message)
  Got cluster reconfigure message

There are further entries in the samba logs, not directly at the time of the CTDB problem though.

There are multiple entries similar to:

[2015/05/18 14:33:41.898756,  0] ../source3/lib/util.c:896(log_stack_trace)
  BACKTRACE: 27 stack frames:
   #0 /usr/lib64/samba/libsmbconf.so.0(log_stack_trace+0x1c) [0x7f404dc429b7]
   #1 /usr/lib64/samba/libsmbconf.so.0(smb_panic_s3+0x55) [0x7f404dc42ab9]
   #2 /usr/lib64/samba/libsamba-util.so.0(smb_panic+0x2d) [0x7f404f78abae]
   #3 /usr/lib64/samba/libtalloc.so.2(+0x23a1) [0x7f404ee673a1]
   #4 /usr/lib64/samba/libtalloc.so.2(+0x41c6) [0x7f404ee691c6]
   #5 /usr/lib64/samba/libtalloc.so.2(_talloc_get_type_abort+0x64) [0x7f404ee6922f]
   #6 /usr/lib64/samba/libsmbconf.so.0(+0x361c6) [0x7f404dc531c6]
   #7 /usr/lib64/samba/libdbwrap.so(dbwrap_parse_record+0x24) [0x7f404a5711d7]
   #8 /usr/lib64/samba/libsmbd_base.so(fetch_share_mode_unlocked+0xa5) [0x7f404f4003e6]
   #9 /usr/lib64/samba/libsmbd_base.so(get_file_infos+0x6e) [0x7f404f3f8765]
   #10 /usr/lib64/samba/libsmbd_base.so(smbd_dirptr_get_entry+0x40f) [0x7f404f327860]
   #11 /usr/lib64/samba/libsmbd_base.so(smbd_dirptr_lanman2_entry+0x1ef) [0x7f404f36176d]
   #12 /usr/lib64/samba/libsmbd_base.so(smbd_smb2_request_process_find+0xd56) [0x7f404f3b5e36]
   #13 /usr/lib64/samba/libsmbd_base.so(smbd_smb2_request_dispatch+0xf89) [0x7f404f3a4d57]
   #14 /usr/lib64/samba/libsmbd_base.so(+0x12b65a) [0x7f404f3a565a]
   #15 /usr/lib64/samba/libsmbd_base.so(+0x12c8af) [0x7f404f3a68af]
   #16 /usr/lib64/samba/libsmbconf.so.0(run_events_poll+0x2c2) [0x7f404dc5f815]
   #17 /usr/lib64/samba/libsmbconf.so.0(+0x42c5d) [0x7f404dc5fc5d]
   #18 /usr/lib64/samba/libtevent.so.0(_tevent_loop_once+0x92) [0x7f404f071407]
   #19 /usr/lib64/samba/libsmbd_base.so(smbd_process+0xf76) [0x7f404f390348]
   #20 /usr/sbin/smbd(+0x8860) [0x7f404fded860]
   #21 /usr/lib64/samba/libsmbconf.so.0(run_events_poll+0x2c2) [0x7f404dc5f815]
   #22 /usr/lib64/samba/libsmbconf.so.0(+0x42c5d) [0x7f404dc5fc5d]
   #23 /usr/lib64/samba/libtevent.so.0(_tevent_loop_once+0x92) [0x7f404f071407]
   #24 /usr/sbin/smbd(main+0x17c9) [0x7f404fdef2ee]
   #25 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f404c6dfd5d]
   #26 /usr/sbin/smbd(+0x58b9) [0x7f404fdea8b9]

Also 1000s of entries similar to:

[2015/05/18 14:09:45.561245,  0] ../source3/modules/vfs_gpfs.c:1799(vfs_gpfs_get_quotas)
  Get fset id failed path </path/to/directory/a/b/c>, dir </path/to/directory/a/b/c>, errno 13.

I'm wondering if we've got multiple problems going on at the same time or if they are all related to some degree.  Could there be a bug in CTDB?

Any pointers greatly appreciated.

Regards,
Danny Metcalfe

---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com