fcntl spinlock in Linux?

Wed Aug 14 10:39:43 MDT 2013

Thanks for confirming it.

2013/8/14 Jeff Layton <jlayton at redhat.com>

> On Wed, 14 Aug 2013 10:52:43 -0400
> Alex Korobkin <korobkin+smb at gmail.com> wrote:
>
> > Hi.
> >
> > 2013/8/14 Jeff Layton <jlayton at redhat.com>
> >
> > > On Tue, 13 Aug 2013 17:15:43 -0400
> > > Alex Korobkin <korobkin+smb at gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I found this discussion here, while troubleshooting an issue with
> kernel
> > > > getting stuck in spin_lock() when a Samba 3.6-based printserver
> serves
> > > > multiple Windows clients.
> > > >
> https://lists.samba.org/archive/samba-technical/2013-January/090122.html
> > > >
> > > > The issue is hard to reproduce. All I can see is random printservers
> > > > crashing once in several days, with kernel (v3.2.5) being stuck in
> the
> > > same
> > > > spin_lock function.
> > > >
> > >
> > > Hmmm...getting stuck on a spinlock is not generally something that
> > > causes a crash. Are they actually crashing or just getting hung on that
> > > lock? Do you know what spinlock it is? Have a stack trace maybe?
> > >
> > >
> > Yes, sorry for bad wording. The machine was hung, not crashed.
> >
> > Here is a trace from SysRq L:
> >
> > [363278.604569] Call Trace:
> > [363278.604578]  [<ffffffff8117e570>] lock_flocks+0x10/0x20
> > [363278.604584]  [<ffffffff8117fbc1>] __posix_lock_file+0x41/0x5c0
> > [363278.604590]  [<ffffffff8118033b>] vfs_lock_file+0x3b/0x40
> > [363278.604596]  [<ffffffff8118064f>] fcntl_setlk+0x16f/0x320
> > [363278.604603]  [<ffffffff811493b7>] sys_fcntl+0x167/0x5c0
> > [363278.604609]  [<ffffffff8169e112>] system_call_fastpath+0x16/0x1b
> > [363278.604613] Code: c3 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89
> e5
> > f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 11 0f 1f 84 00 00 00 00 00 f3
> 90
> > <0f> b6 07 38 c2 75 f7 c9 c3 0f 1f 44 00 00 55 48 89 e5 ff 14 25
> >
> > Machine code quoted there can seemingly be translated back into
> >
> > void lock_flocks(void)
> > {
> >        spin_lock(&file_lock_lock);
> > }
> >
> > in the kernel.
> >
> >
>
> Yep, that's the big locks spinlock alright. The big question is who is
> holding that spinlock and why they haven't released it. Doing that will
> mean trawling through all of the processes running on the box and
> tracking down which one is holding it.
>
>
I'm not entirely sure, but CPU trace shows that smbd PID 6329 is holding
CPU 0, while smbd PID 13879 is holding CPU 1. Both of them are labelled R
in the Runnable Tasks table. Could smbd be fighting with itself for a lock?

> I suspect that the changes I made won't really help you. It'll likely just
> end up changing out that spinlock for the inode->i_lock.
>
> > > The discussion suggests this patch to try with the kernel:
> > > >
> https://lists.samba.org/archive/samba-technical/2013-January/090224.html
> > > >
> > > > I'm not very confident about patching the kernel, and curious if
> there is
> > > > anything I could try to mitigate it on Samba's side. What would you
> > > > recommend?
> > > >
> > >
> > > The 3.11 kernel will be getting a first round of patches that breaks up
> > > the global file_lock_lock spinlock into a per-inode lock for the most
> > > part, and makes some other scalability improvements. Without knowing
> > > what specific problem you're having I can't really say whether those
> > > changes will help you however.
> > >
> > > I'm also working on a set of patches to help address the thundering
> > > herd problem when a lock is released. That was the main problem that
> > > Volker saw. I have a scheme to address that too and a set of patches,
> > > but it's 3.12 material at best (and probably more like 3.13).
> > >
> > > --
> > > Jeff Layton <jlayton at redhat.com>
> > >
> >
> > I'm attaching a per-process stack trace as well for you to have a look.
> > Both CPUs seem to be stalled by smbd processes, please notice this line
> in
> > the logs:
> > [363328.100002] INFO: rcu_sched detected stall on CPU 1 (t=1232040
> jiffies)
> > [363328.100002] Pid: 13879, comm: smbd Not tainted 3.2.5-xen #1
> >
> > I noticed that 3.6.18 was released today with
> > https://bugzilla.samba.org/show_bug.cgi?id=10064 fixed. I'm going to
> try it
> > out and see if it's related at all to this issue.
>
> I sort of doubt it. I don't think we hold the spinlock while waiting
> for the lease to be returned. This sounds more like a kernel bug of
> some sort. Maybe a lock_flocks() imbalance or something, or something
> preempted a task while it was holding that lock.
>
> I see you're running Xen there and it can do all sorts of nefarious
> things. PID 6942 looks like it might be stuck servicing an interrupt
> while holding the lock, but I can't be certain from that stack trace.
>
> --
> Jeff Layton <jlayton at redhat.com>
>