[Samba] ctdb vacuum timeouts and record locks

Wed Nov 15 07:47:39 UTC 2017

On Tue, 14 Nov 2017 22:48:57 -0800, Computerisms Corporation via samba
<samba at lists.samba.org> wrote:

> well, it has been over a week since my last hung process, but got 
> another one today...
> >> So, not sure how to determine if this is a gluster problem, an lxc
> >> problem, or a ctdb/smbd problem.  Thoughts/suggestions are welcome...  
> > 
> > You need a stack trace of the stuck smbd process.  If it is wedged in a
> > system call on the cluster filesystem then you can blame the cluster
> > filesystem.  debug_locks.sh is meant to be able to get you the relevant
> > stack trace via gstack.  In fact, even before you get the stack trace
> > you could check a process listing to see if the process is stuck in D
> > state.  
> 
> So, yes, I do have a process stuck in the D state.  is in an smbd 
> process.  matching up the times in the logs, I see that the the 
> "Vacuuming child process timed out for db locking.tdb" error in ctdb 
> lines up with the user who owns the the smbd process accessing a file 
> that has been problematic before.  it is an xlsx file.
> 
> > gstack basically does:
> > 
> >    gdb -batch -ex "thread apply all bt" -p <pid>
> > 
> > For a single-threaded process it leaves out "thread apply all".
> > However, in recent GDB I'm not sure it makes a difference... seems to
> > work for me on Linux.
> > 
> > Note that gstack/gdb will hang when run against a process in D state.  
> 
> Indeed, gdb, pstack, and strace all either hang or output no information.
> 
> I have been trying to find a way to get the actual gdb output, but all I 
> can seem to find is the contents of /proc/<pid>/stack:
> 
> [<ffffffffc05ed856>] request_wait_answer+0x166/0x1f0 [fuse]
> [<ffffffffa04b8d50>] prepare_to_wait_event+0xf0/0xf0
> [<ffffffffc05ed958>] __fuse_request_send+0x78/0x80 [fuse]
> [<ffffffffc05f0bdd>] fuse_simple_request+0xbd/0x190 [fuse]
> [<ffffffffc05f6c37>] fuse_setlk+0x177/0x190 [fuse]
> [<ffffffffa0659467>] SyS_flock+0x117/0x190
> [<ffffffffa0403b1c>] do_syscall_64+0x7c/0xf0
> [<ffffffffa0a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> I am still not too sure how to interpret this, but I think this is 
> pointing me to the gluster file system, so will see what I can find 
> chasing that down...

Yes, it does look like it is in the gluster filesystem.

Are you only accessing the filesystem via Samba or do you also have
something like NFS exports?  If you are only exporting via Samba then
you could trying setting "posix locking = no" in your Samba
configuration.  However, please read the documentation for that option
in smb.conf(5) and be sure of your use-case before trying this on a
production system...

peace & happiness,
martin