[Samba] CTDB problems

Thu Apr 20 12:45:57 UTC 2017

On 20/04/17 03:19, Martin Schwenke wrote:
> On Wed, 19 Apr 2017 12:55:45 +0100, Alex Crow via samba
> <samba at lists.samba.org> wrote:
>
>> This morning our CTDB managed cluster took a nosedive. We had member
>> machines with hung smbd tasks which causes them to reboot, and the
>> cluster did not come back up consistently. We eventually got it more or
>> less stable with two nodes out of the 3, but we're still seeing worrying
>> messages, eg we've just noticed:
>>
>> [...]
>> 2017/04/19 12:37:19.547636 [vacuum-locking.tdb: 3790]: tdb(/var/lib/ctdb/locking.tdb.2): tdb_oob len 541213780 beyond eof at 55386112
>> 2017/04/19 12:37:19.547694 [vacuum-locking.tdb: 3790]: tdb(/var/lib/ctdb/locking.tdb.2): tdb_free: left offset read failed at 541213776
>> 2017/04/19 12:37:19.547709 [vacuum-locking.tdb: 3790]: tdb(/var/lib/ctdb/locking.tdb.2): tdb_oob len 541213784 beyond eof at 55386112
> No solid guesses on this.  Those messages come from deep in TDB.
>
> Could the filesystem be full?

I don't think it was at this point, we'd cleared out /var/crash earlier 
and saw plenty of space.

>> [...]
>> Here are some logs from earlier, where we think we had a stuck smbd task:
>>
>> 28657 /usr/sbin/smbd locking.tdb.2 9848 9848 W
>> 28687 /usr/sbin/smbd locking.tdb.2 186860 186860 W
>> 18214 /usr/libexec/ctdb/ctdb_lock_helper locking.tdb.2 216548 216550 W
>> 30945 /usr/sbin/smbd brlock.tdb.2.20170419.102626.697770650.corrupt
>> [...]
>> ----- Stack trace for PID=30945 -----
>> ----- Process in D state, printing kernel stack only
>> [<ffffffffa05b253d>] __fuse_request_send+0x13d/0x2c0 [fuse]
>> [<ffffffffa05b26d2>] fuse_request_send+0x12/0x20 [fuse]
>> [<ffffffffa05bb66c>] fuse_setlk+0x16c/0x1a0 [fuse]
>> [<ffffffffa05bc40f>] fuse_file_lock+0x5f/0x210 [fuse]
>> [<ffffffff81253a73>] vfs_lock_file+0x23/0x40
>> [<ffffffff81255069>] fcntl_setlk+0x159/0x310
>> [<ffffffff81210fe1>] SyS_fcntl+0x3e1/0x610
>> [<ffffffff816968c9>] system_call_fastpath+0x16/0x1b
>> [<ffffffffffffffff>] 0xffffffffffffffff
> So this tells you that smbd was wedged in the cluster filesystem.
I've passed this on to the MooseFS devs to see if they know what it 
might be.

>> [...]
>> It does look like we have some database corruption.
>>
>> What may have caused this, and is there any way to resolve it?
> The good news is that you're only seeing it in vacuuming and you're
> not actually seeing TDB errors in smbd.
>
> Still, it isn't something we've seen.  If we figure out anything then
> we'll definitely let you know...
>
> peace & happiness,
> martin

Thanks very much Martin, very helpful of you.

Cheers,

Alex
--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).