fcntl spinlock in Linux?

Wed Aug 14 10:56:14 MDT 2013

On Wed, 14 Aug 2013 12:39:43 -0400
Alex Korobkin <korobkin+smb at gmail.com> wrote:

> Thanks for confirming it.
> 
> 
> 2013/8/14 Jeff Layton <jlayton at redhat.com>
> 
> > On Wed, 14 Aug 2013 10:52:43 -0400
> > Alex Korobkin <korobkin+smb at gmail.com> wrote:
> >
> > > Hi.
> > >
> > > 2013/8/14 Jeff Layton <jlayton at redhat.com>
> > >
> > > > On Tue, 13 Aug 2013 17:15:43 -0400
> > > > Alex Korobkin <korobkin+smb at gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I found this discussion here, while troubleshooting an issue with
> > kernel
> > > > > getting stuck in spin_lock() when a Samba 3.6-based printserver
> > serves
> > > > > multiple Windows clients.
> > > > >
> > https://lists.samba.org/archive/samba-technical/2013-January/090122.html
> > > > >
> > > > > The issue is hard to reproduce. All I can see is random printservers
> > > > > crashing once in several days, with kernel (v3.2.5) being stuck in
> > the
> > > > same
> > > > > spin_lock function.
> > > > >
> > > >
> > > > Hmmm...getting stuck on a spinlock is not generally something that
> > > > causes a crash. Are they actually crashing or just getting hung on that
> > > > lock? Do you know what spinlock it is? Have a stack trace maybe?
> > > >
> > > >
> > > Yes, sorry for bad wording. The machine was hung, not crashed.
> > >
> > > Here is a trace from SysRq L:
> > >
> > > [363278.604569] Call Trace:
> > > [363278.604578]  [<ffffffff8117e570>] lock_flocks+0x10/0x20
> > > [363278.604584]  [<ffffffff8117fbc1>] __posix_lock_file+0x41/0x5c0
> > > [363278.604590]  [<ffffffff8118033b>] vfs_lock_file+0x3b/0x40
> > > [363278.604596]  [<ffffffff8118064f>] fcntl_setlk+0x16f/0x320
> > > [363278.604603]  [<ffffffff811493b7>] sys_fcntl+0x167/0x5c0
> > > [363278.604609]  [<ffffffff8169e112>] system_call_fastpath+0x16/0x1b
> > > [363278.604613] Code: c3 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89
> > e5
> > > f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 11 0f 1f 84 00 00 00 00 00 f3
> > 90
> > > <0f> b6 07 38 c2 75 f7 c9 c3 0f 1f 44 00 00 55 48 89 e5 ff 14 25
> > >
> > > Machine code quoted there can seemingly be translated back into
> > >
> > > void lock_flocks(void)
> > > {
> > >        spin_lock(&file_lock_lock);
> > > }
> > >
> > > in the kernel.
> > >
> > >
> >
> > Yep, that's the big locks spinlock alright. The big question is who is
> > holding that spinlock and why they haven't released it. Doing that will
> > mean trawling through all of the processes running on the box and
> > tracking down which one is holding it.
> >
> >
> I'm not entirely sure, but CPU trace shows that smbd PID 6329 is holding
> CPU 0, while smbd PID 13879 is holding CPU 1. Both of them are labelled R
> in the Runnable Tasks table. Could smbd be fighting with itself for a lock?
> 

Spinlocks are intended to be held only for short amounts of time. If
you're hung on one then something is wrong. This is not a userland
problem, but rather a problem in your kernel.

> 
> > I suspect that the changes I made won't really help you. It'll likely just
> > end up changing out that spinlock for the inode->i_lock.
> >
> > > > The discussion suggests this patch to try with the kernel:
> > > > >
> > https://lists.samba.org/archive/samba-technical/2013-January/090224.html
> > > > >
> > > > > I'm not very confident about patching the kernel, and curious if
> > there is
> > > > > anything I could try to mitigate it on Samba's side. What would you
> > > > > recommend?
> > > > >
> > > >
> > > > The 3.11 kernel will be getting a first round of patches that breaks up
> > > > the global file_lock_lock spinlock into a per-inode lock for the most
> > > > part, and makes some other scalability improvements. Without knowing
> > > > what specific problem you're having I can't really say whether those
> > > > changes will help you however.
> > > >
> > > > I'm also working on a set of patches to help address the thundering
> > > > herd problem when a lock is released. That was the main problem that
> > > > Volker saw. I have a scheme to address that too and a set of patches,
> > > > but it's 3.12 material at best (and probably more like 3.13).
> > > >
> > > > --
> > > > Jeff Layton <jlayton at redhat.com>
> > > >
> > >
> > > I'm attaching a per-process stack trace as well for you to have a look.
> > > Both CPUs seem to be stalled by smbd processes, please notice this line
> > in
> > > the logs:
> > > [363328.100002] INFO: rcu_sched detected stall on CPU 1 (t=1232040
> > jiffies)
> > > [363328.100002] Pid: 13879, comm: smbd Not tainted 3.2.5-xen #1
> > >
> > > I noticed that 3.6.18 was released today with
> > > https://bugzilla.samba.org/show_bug.cgi?id=10064 fixed. I'm going to
> > try it
> > > out and see if it's related at all to this issue.
> >
> > I sort of doubt it. I don't think we hold the spinlock while waiting
> > for the lease to be returned. This sounds more like a kernel bug of
> > some sort. Maybe a lock_flocks() imbalance or something, or something
> > preempted a task while it was holding that lock.
> >
> > I see you're running Xen there and it can do all sorts of nefarious
> > things. PID 6942 looks like it might be stuck servicing an interrupt
> > while holding the lock, but I can't be certain from that stack trace.
> >
> > --
> > Jeff Layton <jlayton at redhat.com>
> >

-- 
Jeff Layton <jlayton at redhat.com>