Kernel mutexes, SegFaults, tdb?

Wed Aug 29 09:29:17 GMT 2001

Hiya all.

We're having various problems here, which I am in the middle of diagnosing,
but I'll drop in what I've found so far, because it's been a fascinatingly
educational problem for me, and maybe others will find it so too...

Server is a Sun E450, 2x450mhz, 2G, Solaris 8, Samba-2.2.1a with minor local
patch. (A few line to allow users who are a member of a files group to
change the readonly bit, rather than just the owner). Gigabit interface,
100mb clients.

On Tuesday/Wednesday this week, we had huge performance issues, the server
essentially showing regular performance drops. When I say performance drop,
I mean 150K/s transfers, 30 second response times from the Cyrus Imap server
on this box, etc etc. Normal is 8mb/s transfers, instant response from
cyrus.

I traced the problem to massive mutex contention in the kernel, I believe in
a function called "reclock". Reclock is called from fs_frlock, called from
ufs_frlock, called from fcntl.
I also found half a dozen Sig 11 (segfault) and panics in various samba log
files. I have yet to track down the cause, as I'm having a hard time
reproducing it. Suprise suprise, the mutex problem started about the same
time as the samba problems. Restarting samba has returned the system to
normal.

While tracing this this afternoon, I've noticed our server has occasional
spikes in mutex contention, it averages around 1-15 spins on a mutex per
second, then every now and again, it'll spike to 2000-3000.

[speculation starts here]

I've had a cursory browse of the tdb source, and it uses fcntl for locking.
Is there some kind of situation that anyone can think of which would cause
all the samba processes to try and write to a tdb database? Something on the
lines of a broadcast of some type which causes all the workstations to send
[something] to their respective samba processes, thereby causing them to all
try and access the tdb at the same time?

It also seems to me that the mutex problems we had apear to be a result of
the segfaults in samba. What I don't understand is how samba dieing could
leave mutexes locked in the kernel! More likely I guess is that samba leaves
a tdb file locked when it segfaults, causing the other smb processes to call
fcntl much more often as the try and obtain the lock?

Anyone have any thoughts?

Stay cool...
T.