[Samba] hanging smbd(s) revisited

William Jojo jojowil at hvcc.edu
Tue Feb 28 18:30:40 GMT 2006


----- Original Message ----- 
From: "William Jojo" <jojowil at hvcc.edu>
To: "Gerald (Jerry) Carter" <jerry at samba.org>
Cc: <samba at lists.samba.org>; "Andrew Tridgell" <tridge at samba.org>; "Jeremy
Allison" <jra at samba.org>
Sent: Saturday, February 25, 2006 11:38 AM
Subject: Re: [Samba] hanging smbd(s) revisited


>
>
> On Sat, 25 Feb 2006, Gerald (Jerry) Carter wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Matt Johnson wrote:
> >> Hi,
> >>
> >> Just to add -- our fcntl locking issue is on Linux, we've
> >> seen it on 2.6.9, 2.6.13.1 and 2.6.15.3, running Mandrake 10.2.
> >> locking.tdb is on a local disk. All smbd child processes are
> >> blocked on apparently the same fcntl when it happens.
> >
> > Hmmm...ok.  That ruins my theory.  I thought you were on AIX
> > as well.  And just to make sure, you are running Samba 3.0.21b
> > as well?
> >
> >
>
> Is it possible you're on the right track, but manifests differently on our
> two systems? :-)
>
> Last semester we were running 3.0.20 on this machine. We've been toying
> with going back to that code base to see if it stabalizes. (Of course
> putting deadtime back to zero for the test.) That still has me confused as
> to why the non-zero deadtime seems to make the whole environment more
> stable. It's the *only* modification we've made that has had any impact at
> all.
>
> If 3.0.20 fails, then it's most likely a kernel bug. It's so hard to get
> IBM to move on this without *ahem* additional compensation. I even got the
> duty manager involved on the PMR since I got the brush-off.
>
> I would appreciate any more info you have on the fcntl bug you mentioned
> so I can run it by IBM. I think on Monday we'll try 3.0.20 and see what
> happens.
>

So we've gone back to 3.0.20 and we're stable again. I should indicate that
it's 3.0.20 with patches 9484, 9481 and 9456 to fix Win98 dir loop, excel
shared workbook and ACLs (not necessarily in that order).

Since the problem manifests in the filesystem where our Samba install is,
and it appears to be a tdb (namely locking.tdb for fd=15, but can't identify
the fd=3 that spins unmercifully), I'm wondering if *maybe* it could be the
"Fix for tdb clear-if-first race condition." or some other tdb change after
3.0.20 that traded one bug for another? I'm guessing... :-)

We upgraded from 3.0.20 to 3.0.21a for production. It never showed up in
development for any version after 3.0.20 since we can't generate that kind
of random load, so of course we thought everything was cool.

Again, this only happens under heavy load, daily and clears up with a bounce
of smbd. It seems to be related to a few hundred students logging off and a
few hundred more logging on (classes are switching). Also we noticed that
there are several hundred and in some cases a couple thousand cookie files
being transfered around in roaming profiles per student (they were not
redirected).

We are going to start moving to 20a, then 20b, then to 21 then back to 21a
where we started (21b did it too, haven't tried 21c yet) after another day
or two of 3.0.20 to make sure we're not losing our mind.

AIX 5.2 TL-08-1, Windows XP-SP2 clients. Storage is a CX-700 EMC SAN (which
rocks, btw)

Anything more I can provide, let me know. :-)


Cheers,

Bill



> cheers,
>
> Bill
>
>
> >
> >
> >
> >
> > cheers, jerry
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2 (MingW32)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFEAIVqIR7qMdg1EfYRAhY9AJsGDSjVGISuB7s5gXiN7SROGskv5wCcCj/C
> > vk+23YRv9n1CWpYkQRXO17o=
> > =dGU1
> > -----END PGP SIGNATURE-----
> > -- 
> > To unsubscribe from this list go to the following URL and read the
> > instructions:  https://lists.samba.org/mailman/listinfo/samba
> >
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/listinfo/samba
>



More information about the samba mailing list