[PATCH] Fix bug #13121 - Non-smbd processes using kernel oplocks can hang smbd

Wed Nov 29 23:16:35 UTC 2017

On Thu, Nov 30, 2017 at 12:09:05AM +0100, Ralph Böhme wrote:
> On Wed, Nov 29, 2017 at 03:01:17PM -0800, Jeremy Allison wrote:
> > On Wed, Nov 29, 2017 at 11:53:31PM +0100, Ralph Böhme wrote:
> > > On Wed, Nov 29, 2017 at 02:13:34PM -0800, Jeremy Allison via samba-technical wrote:
> > > > On Thu, Nov 30, 2017 at 11:05:39AM +1300, Andrew Bartlett wrote:
> > > > > On Thu, 2017-11-30 at 06:23 +1300, Andrew Bartlett wrote:
> > > > > > On Wed, 2017-11-29 at 08:43 -0800, Jeremy Allison wrote:
> > > > > > > 
> > > > > > > Thanks for persevering with this. I'm OK with you
> > > > > > > marking it flakey now you can reproduce locally.
> > > > > > 
> > > > > > Good.
> > > > > 
> > > > > I've done the fixes required for the test, and I'll push it shortly.  
> > > > > 
> > > > > This is a 'real' flapping test, it also flaps on sn-devel if you run
> > > > > the loop for long enough.
> > > > 
> > > > Thanks a lot ! I'm very puzzled by the error 10 though - it
> > > > means a missing RT signal. I'll try and get some time to
> > > > investigate with a standalone program.
> > > 
> > > if you need an additional pair of eyes, let me know. I've been carefully going
> > > through the test looking for race conditions causing signal loss or similar, no
> > > luck so far, test seems correct. I was specifically worried about the while loop
> > > around tevent_loop_once, but with tevent there shouldn't be a race condition
> > > between signal delivery and waiting for signal. *scratches head*
> > 
> > Yeah, I simply can't see a place the signal loss can
> > occur unless it's the kernel dropping the ball.
> > 
> > Note that the signal loss occurs in the non-smbd/non-samba
> > client test code (thats the forked child from smbtorture
> > that opens the test file, gets the lease, and then waits
> > for the kernel to signal a lease break from the smbd).
> > 
> > That child is returning with an exit code of 10, meaning
> > the alarm(5) fired when we were in the pause() call instead
> > of getting the RT_SIGNAL_LEASE signal.
> 
> yeah.
> 
> One idea: can we run the test as root, enable corefile generation on the sytem
> and add a killall KILL smbd to the child when the alarm fires because no
> RT_SIGNAL_LEASE was generated? We could then gdb the smbd session process
> corefile and check whether it was stuck somewhere unexpectedly.

Actually the other poster gave me an idea. We should be able
to add an extra debug message from the smbd when it gets -1,EWOULDBLOCK
on the open - and make sure that message gets logged.