fcntl spinlock in Linux?

Jeff Layton jlayton at redhat.com
Wed Aug 14 09:35:47 MDT 2013


On Wed, 14 Aug 2013 10:52:43 -0400
Alex Korobkin <korobkin+smb at gmail.com> wrote:

> Hi.
> 
> 2013/8/14 Jeff Layton <jlayton at redhat.com>
> 
> > On Tue, 13 Aug 2013 17:15:43 -0400
> > Alex Korobkin <korobkin+smb at gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I found this discussion here, while troubleshooting an issue with kernel
> > > getting stuck in spin_lock() when a Samba 3.6-based printserver serves
> > > multiple Windows clients.
> > > https://lists.samba.org/archive/samba-technical/2013-January/090122.html
> > >
> > > The issue is hard to reproduce. All I can see is random printservers
> > > crashing once in several days, with kernel (v3.2.5) being stuck in the
> > same
> > > spin_lock function.
> > >
> >
> > Hmmm...getting stuck on a spinlock is not generally something that
> > causes a crash. Are they actually crashing or just getting hung on that
> > lock? Do you know what spinlock it is? Have a stack trace maybe?
> >
> >
> Yes, sorry for bad wording. The machine was hung, not crashed.
> 
> Here is a trace from SysRq L:
> 
> [363278.604569] Call Trace:
> [363278.604578]  [<ffffffff8117e570>] lock_flocks+0x10/0x20
> [363278.604584]  [<ffffffff8117fbc1>] __posix_lock_file+0x41/0x5c0
> [363278.604590]  [<ffffffff8118033b>] vfs_lock_file+0x3b/0x40
> [363278.604596]  [<ffffffff8118064f>] fcntl_setlk+0x16f/0x320
> [363278.604603]  [<ffffffff811493b7>] sys_fcntl+0x167/0x5c0
> [363278.604609]  [<ffffffff8169e112>] system_call_fastpath+0x16/0x1b
> [363278.604613] Code: c3 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5
> f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 11 0f 1f 84 00 00 00 00 00 f3 90
> <0f> b6 07 38 c2 75 f7 c9 c3 0f 1f 44 00 00 55 48 89 e5 ff 14 25
> 
> Machine code quoted there can seemingly be translated back into
> 
> void lock_flocks(void)
> {
>        spin_lock(&file_lock_lock);
> }
> 
> in the kernel.
> 
> 

Yep, that's the big locks spinlock alright. The big question is who is
holding that spinlock and why they haven't released it. Doing that will
mean trawling through all of the processes running on the box and
tracking down which one is holding it.

I suspect that the changes I made won't really help you. It'll likely just
end up changing out that spinlock for the inode->i_lock.

> > The discussion suggests this patch to try with the kernel:
> > > https://lists.samba.org/archive/samba-technical/2013-January/090224.html
> > >
> > > I'm not very confident about patching the kernel, and curious if there is
> > > anything I could try to mitigate it on Samba's side. What would you
> > > recommend?
> > >
> >
> > The 3.11 kernel will be getting a first round of patches that breaks up
> > the global file_lock_lock spinlock into a per-inode lock for the most
> > part, and makes some other scalability improvements. Without knowing
> > what specific problem you're having I can't really say whether those
> > changes will help you however.
> >
> > I'm also working on a set of patches to help address the thundering
> > herd problem when a lock is released. That was the main problem that
> > Volker saw. I have a scheme to address that too and a set of patches,
> > but it's 3.12 material at best (and probably more like 3.13).
> >
> > --
> > Jeff Layton <jlayton at redhat.com>
> >
> 
> I'm attaching a per-process stack trace as well for you to have a look.
> Both CPUs seem to be stalled by smbd processes, please notice this line in
> the logs:
> [363328.100002] INFO: rcu_sched detected stall on CPU 1 (t=1232040 jiffies)
> [363328.100002] Pid: 13879, comm: smbd Not tainted 3.2.5-xen #1
> 
> I noticed that 3.6.18 was released today with
> https://bugzilla.samba.org/show_bug.cgi?id=10064 fixed. I'm going to try it
> out and see if it's related at all to this issue.

I sort of doubt it. I don't think we hold the spinlock while waiting
for the lease to be returned. This sounds more like a kernel bug of
some sort. Maybe a lock_flocks() imbalance or something, or something
preempted a task while it was holding that lock.

I see you're running Xen there and it can do all sorts of nefarious
things. PID 6942 looks like it might be stuck servicing an interrupt
while holding the lock, but I can't be certain from that stack trace.

-- 
Jeff Layton <jlayton at redhat.com>


More information about the samba-technical mailing list