[PATCH] Fix bug #13121 - Non-smbd processes using kernel oplocks can hang smbd

Ralph Böhme slow at samba.org
Wed Nov 29 23:09:05 UTC 2017


On Wed, Nov 29, 2017 at 03:01:17PM -0800, Jeremy Allison wrote:
> On Wed, Nov 29, 2017 at 11:53:31PM +0100, Ralph Böhme wrote:
> > On Wed, Nov 29, 2017 at 02:13:34PM -0800, Jeremy Allison via samba-technical wrote:
> > > On Thu, Nov 30, 2017 at 11:05:39AM +1300, Andrew Bartlett wrote:
> > > > On Thu, 2017-11-30 at 06:23 +1300, Andrew Bartlett wrote:
> > > > > On Wed, 2017-11-29 at 08:43 -0800, Jeremy Allison wrote:
> > > > > > 
> > > > > > Thanks for persevering with this. I'm OK with you
> > > > > > marking it flakey now you can reproduce locally.
> > > > > 
> > > > > Good.
> > > > 
> > > > I've done the fixes required for the test, and I'll push it shortly.  
> > > > 
> > > > This is a 'real' flapping test, it also flaps on sn-devel if you run
> > > > the loop for long enough.
> > > 
> > > Thanks a lot ! I'm very puzzled by the error 10 though - it
> > > means a missing RT signal. I'll try and get some time to
> > > investigate with a standalone program.
> > 
> > if you need an additional pair of eyes, let me know. I've been carefully going
> > through the test looking for race conditions causing signal loss or similar, no
> > luck so far, test seems correct. I was specifically worried about the while loop
> > around tevent_loop_once, but with tevent there shouldn't be a race condition
> > between signal delivery and waiting for signal. *scratches head*
> 
> Yeah, I simply can't see a place the signal loss can
> occur unless it's the kernel dropping the ball.
> 
> Note that the signal loss occurs in the non-smbd/non-samba
> client test code (thats the forked child from smbtorture
> that opens the test file, gets the lease, and then waits
> for the kernel to signal a lease break from the smbd).
> 
> That child is returning with an exit code of 10, meaning
> the alarm(5) fired when we were in the pause() call instead
> of getting the RT_SIGNAL_LEASE signal.

yeah.

One idea: can we run the test as root, enable corefile generation on the sytem
and add a killall KILL smbd to the child when the alarm fires because no
RT_SIGNAL_LEASE was generated? We could then gdb the smbd session process
corefile and check whether it was stuck somewhere unexpectedly.

-slow

-- 
Ralph Boehme, Samba Team       https://samba.org/
Samba Developer, SerNet GmbH   https://sernet.de/en/samba/



More information about the samba-technical mailing list