[Samba] ctdb vacuum timeouts and record locks
Computerisms Corporation
bob at computerisms.ca
Fri Oct 27 05:44:30 UTC 2017
Hi List,
I set up a ctdb cluster a couple months back. Things seemed pretty
solid for the first 2-3 weeks, but then I started getting reports of
people not being able to access files, or some times directories. It
has taken me a while to figure some stuff out, but it seems the common
denominator to this happening is vacuuming timeouts for locking.tdb in
the ctdb log, which might go on every 2 minutes and 10 seconds for
anywhere from an hour to a day and some, and then it will also add to
the logs failure to get a RECORD lock on the same tdb file. Whenever I
get a report about inaccessible files I find this in the ctdb logs:
ctdbd[89]: Vacuuming child process timed out for db locking.tdb
ctdbd[89]: Vacuuming child process timed out for db locking.tdb
ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 10 seconds
ctdbd[89]: Set lock debugging helper to
"/usr/local/samba/etc/ctdb/debug_locks.sh"
/usr/local/samba/etc/ctdb/debug_locks.sh: 142:
/usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory
nonexistent
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
cat: write error: Broken pipe
sh: echo: I/O error
ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 seconds
/usr/local/samba/etc/ctdb/debug_locks.sh: 142:
/usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory
nonexistent
sh: echo: I/O error
sh: echo: I/O error
From googling, the vacuuming process is okay to timeout, it should
succeed next time, and if it doesn't the only harm is a bloated file.
But it never does succeed after the first time I see this message, and
the locking.tdb file does not change size, bigger or smaller.
I am not really clear on what the script cannot create, but I did find
no evidence of the gstack program being available on debian, so I
changed the script to run pstack instead, and then ran it manually with
set -x while the logs were recording the problem, and I think this is
the trace output it is trying to come up with, but sadly this isn't
meaningful to me (yet!):
cat /proc/30491/stack
[<ffffffff8197d00d>] inet_recvmsg+0x7d/0xb0
[<ffffffffc07c3856>] request_wait_answer+0x166/0x1f0 [fuse]
[<ffffffff814b8d50>] prepare_to_wait_event+0xf0/0xf0
[<ffffffffc07c3958>] __fuse_request_send+0x78/0x80 [fuse]
[<ffffffffc07c6bdd>] fuse_simple_request+0xbd/0x190 [fuse]
[<ffffffffc07ccc37>] fuse_setlk+0x177/0x190 [fuse]
[<ffffffff816592f7>] SyS_flock+0x117/0x190
[<ffffffff81403b1c>] do_syscall_64+0x7c/0xf0
[<ffffffff81a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
This might happen twice in a day or once in a week, doesn't seem
consistent, and so far I haven't found any catalyst.
My setup is two servers, the OS is debian and is running samba AD on
dedicated SSDs, and each server has a RAID array of HDDs for storage,
with a mirrored GlusterFS running on top of them. Each OS has an LXC
container running the clustered member servers with the GlusterFS
mounted to the containers. The tdb files are in the containers, not on
the shared storage. I do not use ctdb to start smbd/nmbd. I can't
think what else is relevant about my setup as it pertains to this issue...
I can fix the access to the files by stopping the ctdb process and just
letting the other cluster member run, but the only way I have found so
far to fix the locking.tdb file is to shutdown the container. sometimes
I have to forcefully kill it from the host.
The errors are not confined to one member of the cluster, I have seen
them happen on both of them. Though, of the people reporting the
problem, it often seems to be the same files causing the problem.
Before I had figured out about ctdb logs, several times there were
people who couldn't access a specific folder, but removing a specific
file from that folder fixed it.
I have put lots of hours into google on this and nothing I have found
has turned the light bulb in my brain on. Maybe (hopefully, actually) I
am overlooking something obvious. Wondering if anyone can point me at
the next step in troubleshooting this?
--
Bob Miller
Cell: 867-334-7117
Office: 867-633-3760
www.computerisms.ca
More information about the samba
mailing list