[Samba] CTDB problems

Alex Crow acrow at integrafin.co.uk
Wed Apr 19 11:55:45 UTC 2017


Hi,

This morning our CTDB managed cluster took a nosedive. We had member 
machines with hung smbd tasks which causes them to reboot, and the 
cluster did not come back up consistently. We eventually got it more or 
less stable with two nodes out of the 3, but we're still seeing worrying 
messages, eg we've just noticed:

2017/04/19 12:10:31.168891 [ 5417]: Vacuuming child process timed out 
for db brlock.tdb
2017/04/19 12:11:39.250169 [ 5417]: Unable to get RECORD lock on 
database brlock.tdb for 500 seconds
===== Start of debug locks PID=9372 =====
8084 /usr/sbin/smbd brlock.tdb.2 7044 7044
20931 /usr/libexec/ctdb/ctdb_lock_helper brlock.tdb.2 7044 7044 W
21665 /usr/sbin/smbd brlock.tdb.2 174200 174200
----- Stack trace for PID=21665 -----
2017/04/19 12:11:39.571097 [ 5417]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=8084 -----
2017/04/19 12:11:39.571346 [ 5417]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
===== End of debug locks PID=9372 =====
2017/04/19 12:37:19.547636 [vacuum-locking.tdb: 3790]: 
tdb(/var/lib/ctdb/locking.tdb.2): tdb_oob len 541213780 beyond eof at 
55386112
2017/04/19 12:37:19.547694 [vacuum-locking.tdb: 3790]: 
tdb(/var/lib/ctdb/locking.tdb.2): tdb_free: left offset read failed at 
541213776
2017/04/19 12:37:19.547709 [vacuum-locking.tdb: 3790]: 
tdb(/var/lib/ctdb/locking.tdb.2): tdb_oob len 541213784 beyond eof at 
55386112

Here are some logs from earlier, where we think we had a stuck smbd task:

28657 /usr/sbin/smbd locking.tdb.2 9848 9848 W
28687 /usr/sbin/smbd locking.tdb.2 186860 186860 W
28692 /usr/sbin/smbd locking.tdb.2 62164 62164 W
28695 /usr/sbin/smbd locking.tdb.2 22836 22836 W
28729 /usr/sbin/smbd locking.tdb.2 81228 81228 W
28855 /usr/sbin/smbd locking.tdb.2 170264 170264 W
28915 /usr/sbin/smbd locking.tdb.2 33040 33040 W
28916 /usr/sbin/smbd locking.tdb.2 99140 99140 W
28917 /usr/sbin/smbd locking.tdb.2 156412 156412 W
30493 /usr/sbin/smbd locking.tdb.2 186860 186860 W
30582 /usr/sbin/smbd locking.tdb.2 130424 130424 W
30637 /usr/sbin/smbd locking.tdb.2 214724 214724 W
12492 /usr/sbin/smbd locking.tdb.2 124060 124060 W
30645 /usr/sbin/smbd locking.tdb.2 128364 128364 W
30657 /usr/sbin/smbd locking.tdb.2 186828 186828 W
30697 /usr/sbin/smbd locking.tdb.2 11364 11364 W
30997 /usr/sbin/smbd locking.tdb.2 9848 9848 W
30999 /usr/sbin/smbd locking.tdb.2 128364 128364 W
31018 /usr/sbin/smbd locking.tdb.2 151204 151204 W
31021 /usr/sbin/smbd locking.tdb.2 186860 186860 W
31049 /usr/sbin/smbd locking.tdb.2 22836 22836 W
31051 /usr/sbin/smbd locking.tdb.2 33432 33432 W
10555 /usr/sbin/smbd brlock.tdb.2 58972 58972
18215 /usr/libexec/ctdb/ctdb_lock_helper brlock.tdb.2 58971 58973 W
17216 /usr/sbin/smbd locking.tdb.2 232120 232120
18215 /usr/libexec/ctdb/ctdb_lock_helper brlock.tdb.2 168 58970
17719 /usr/sbin/smbd locking.tdb.2 330636 330636
14579 /usr/sbin/smbd locking.tdb.2 216548 216548
18214 /usr/libexec/ctdb/ctdb_lock_helper locking.tdb.2 216548 216550 W
30945 /usr/sbin/smbd brlock.tdb.2.20170419.102626.697770650.corrupt 
160288 160288
12448 /usr/sbin/smbd brlock.tdb.2 216548 216548
8990 /usr/sbin/smbd locking.tdb.2 281176 281176
17225 /usr/sbin/smbd locking.tdb.2 225860 225860
----- Stack trace for PID=10555 -----
2017/04/19 10:40:31.291747 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=10566 -----
2017/04/19 10:40:31.291982 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=12448 -----
2017/04/19 10:40:31.292204 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=14579 -----
2017/04/19 10:40:31.292428 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17216 -----
2017/04/19 10:40:31.292648 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17225 -----
2017/04/19 10:40:31.292865 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17719 -----
2017/04/19 10:40:31.293086 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=18214 -----
2017/04/19 10:40:31.293308 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=18215 -----
2017/04/19 10:40:31.293528 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=30945 -----
----- Process in D state, printing kernel stack only
[<ffffffffa05b253d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[<ffffffffa05b26d2>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa05bb66c>] fuse_setlk+0x16c/0x1a0 [fuse]
[<ffffffffa05bc40f>] fuse_file_lock+0x5f/0x210 [fuse]
[<ffffffff81253a73>] vfs_lock_file+0x23/0x40
[<ffffffff81255069>] fcntl_setlk+0x159/0x310
[<ffffffff81210fe1>] SyS_fcntl+0x3e1/0x610
[<ffffffff816968c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
----- Stack trace for PID=8990 -----
2017/04/19 10:40:31.294250 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
===== End of debug locks PID=31055 =====

And a bit later:

18215 /usr/libexec/ctdb/ctdb_lock_helper brlock.tdb.2 168 58970
17719 /usr/sbin/smbd locking.tdb.2 330636 330636
14579 /usr/sbin/smbd locking.tdb.2 216548 216548
18214 /usr/libexec/ctdb/ctdb_lock_helper locking.tdb.2 216548 216550 W
12448 /usr/sbin/smbd brlock.tdb.2 216548 216548
8990 /usr/sbin/smbd locking.tdb.2 281176 281176
17225 /usr/sbin/smbd locking.tdb.2 225860 225860
----- Stack trace for PID=10555 -----
2017/04/19 10:42:11.363670 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=10566 -----
2017/04/19 10:42:11.363907 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=12448 -----
2017/04/19 10:42:11.364133 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=14579 -----
2017/04/19 10:42:11.364375 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17216 -----
2017/04/19 10:42:11.364600 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17225 -----
2017/04/19 10:42:11.364819 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=17719 -----
2017/04/19 10:42:11.365038 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=18214 -----
2017/04/19 10:42:11.365257 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=18215 -----
2017/04/19 10:42:11.365483 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
----- Stack trace for PID=8990 -----
2017/04/19 10:42:11.365705 [ 7423]: /etc/ctdb/debug_locks.sh: line 73: 
gstack: command not found
===== End of debug locks PID=1213 =====
2017/04/19 10:42:21.315316 [ 7423]: Recovery daemon ping timeout. Count : 0
2017/04/19 10:42:21.462781 [recoverd: 7632]: recovery: control FREEZE_DB 
failed for db locking.tdb on node 1, ret=110
2017/04/19 10:42:21.462835 [recoverd: 7632]: recovery: recover database 
0x7a19d84d, attempt 3
2017/04/19 10:42:21.462855 [recoverd: 7632]: recovery: control FREEZE_DB 
failed for db brlock.tdb on node 1, ret=110
2017/04/19 10:42:21.462883 [recoverd: 7632]: recovery: recover database 
0x4e66c2b2, attempt 3
2017/04/19 10:42:51.463173 [recoverd: 7632]: recovery: control FREEZE_DB 
failed for db locking.tdb on node 1, ret=110
2017/04/19 10:42:51.463218 [recoverd: 7632]: recovery: control FREEZE_DB 
failed for db brlock.tdb on node 1, ret=110
2017/04/19 10:42:51.463230 [recoverd: 7632]: recovery: 19 of 21 
databases recovered
2017/04/19 10:42:51.463234 [recoverd: 7632]: recovery: database recovery 
failed, ret=5
2017/04/19 10:42:51.466924 [recoverd: 7632]: 
../ctdb/server/ctdb_recoverd.c:2013 Starting do_recovery

It does look like we have some database corruption.

What may have caused this, and is there any way to resolve it?

Our samba and CTDB version is Version 4.4.9-SerNet-RedHat-37.el7, and 
I've checked that are consistent.

Any help would be most gratefully received.

Alex

--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).



More information about the samba mailing list