[Samba] hanging smbd(s) revisited

William Jojo jojowil at hvcc.edu
Wed Mar 1 16:20:07 GMT 2006



----- Original Message ----- 
From: "William Jojo" <jojowil at hvcc.edu>
To: "Jeremy Allison" <jra at samba.org>
Cc: <samba at lists.samba.org>; "Gerald (Jerry) Carter" <jerry at samba.org>;
"Andrew Tridgell" <tridge at samba.org>; "Jeremy Allison" <jra at samba.org>
Sent: Tuesday, February 28, 2006 4:33 PM
Subject: Re: [Samba] hanging smbd(s) revisited


>
> ----- Original Message ----- 
> From: "Jeremy Allison" <jra at samba.org>
> To: "William Jojo" <jojowil at hvcc.edu>
> Cc: <samba at lists.samba.org>; "Gerald (Jerry) Carter" <jerry at samba.org>;
> "Andrew Tridgell" <tridge at samba.org>; "Jeremy Allison" <jra at samba.org>
> Sent: Tuesday, February 28, 2006 3:25 PM
> Subject: Re: [Samba] hanging smbd(s) revisited
>
>
> > On Tue, Feb 28, 2006 at 01:30:40PM -0500, William Jojo wrote:
> > >
> > > So we've gone back to 3.0.20 and we're stable again. I should indicate
> that
> > > it's 3.0.20 with patches 9484, 9481 and 9456 to fix Win98 dir loop,
> excel
> > > shared workbook and ACLs (not necessarily in that order).
> > >
> > > Since the problem manifests in the filesystem where our Samba install
> is,
> > > and it appears to be a tdb (namely locking.tdb for fd=15, but can't
> identify
> > > the fd=3 that spins unmercifully), I'm wondering if *maybe* it could
be
> the
> > > "Fix for tdb clear-if-first race condition." or some other tdb change
> after
> > > 3.0.20 that traded one bug for another? I'm guessing... :-)
> >
> > Identifying that fd would be really useful.
>
> Ok, dug it up. This is the IBM info.
>
>
> ----- Original Message ----- 
> From: Robert Elias
> To: jojowil at hvcc.edu
> Sent: Monday, February 27, 2006 12:30 PM
> Subject: Pmr#47402,180
>
>
> Bill,
>
> Thank you for patience while I work through your questions. I ran this
issue
> by our level 3 performance team and received the following input.
>
> The file in question is inode 12363 in /samba. Use 'find /samba -inum
12363'
> to determine the file name.
>
> I ran this by the Samba team members that work for IBM and they suggested
> the following:
>
> As a long shot, I suggest that you have him run tdbtorture (a file i/o
> testcase) from the samba source tree as that does a simulation of the
> locking that Samba does and if we have a bug in AIX locking.
>
> Your comments or thoughts?
>
> Thanks,
>
> Robert Elias
> AIX Duty Manager
> IBM Integrated Technology Services
> 214-257-9292 - T/L 972
>
>
>
>
>
>
> [storage:/samba/3.0.21b] # find /samba -inum 12363
> /samba/3.0.21b/var/locks/locking.tdb
>
>
>
> > > We are going to start moving to 20a, then 20b, then to 21 then back to
> 21a
> > > where we started (21b did it too, haven't tried 21c yet) after another
> day
> > > or two of 3.0.20 to make sure we're not losing our mind.
> >
> > I've looked over the logic for the aquiring/release of the lock
> > for the locking.tdb in the 3.0.21c release code - I can't see any
possible
> > paths, error or otherwise where the lock can be left live on a
> > record. I'll keep looking though. When it's spinning, what is the errno
> that the fcntl call
> > returns ?
> >
>
> What appears to happen is pid 266946 is exiting (exited?) and some kind of
> dealock has occured which shows the following in filemon.sum from the
> perfpmr that IBM had me run during the event.
>
>
> <snip>
> 9603204 hooks processed (incl. 2108 utility)
> 60.013 secs in measured interval
> Cpu utilization:  42.9%
>
> Most Active Files
> ------------------------------------------------------------------------
>   #MBs  #opns   #rds   #wrs  file                     volume:inode
> ------------------------------------------------------------------------
>  230.1      0  29492      0  pid=266946_fd=3
>   43.3      0   1588    129  pid=240270_fd=5
> </snip>
>
>
> My question to IBM was how can this happen? The above inode number is what
> was provided to me yesterday.
>
> Since moving to 3.0.20 the problem has subsided, I'm back here and not
> bugging IBM at the moment. :-|
>
> Whatever else I can get you, just say the word. :-)
>
> Do you agree with us to step to 20a, 20b ... ?
>
>

We've survived two days on 3.0.20, and our load is even more than when we
started. We have over 1000 smbd's running on this machine and it's not even
breaking a sweat.

Now additonally, I'm looking through source/locking/locking.c I notice that
diff of 3.0.20 and 20a and 20b have no changes. Then in 3.0.21 there's an
invasive change. (locking/posix.c remains unchanged through 21b.)

I'm pretty certain that 20a and 20b will be fine for us based on what I see,
but I'm still learning (and comprehending :-) ) these changes looking for a
smoking gun. And tomorrow I will put 20b (skipping 20a) in place on this
server.

I'm opening a bug because I think this one is real and load related.


Cheers,

Bill



> Cheers,
>
> Bill
>
>
> > Jeremy.
> > -- 
> > To unsubscribe from this list go to the following URL and read the
> > instructions:  https://lists.samba.org/mailman/listinfo/samba
> >
>
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/listinfo/samba
>



More information about the samba mailing list