[Samba] Samba 3.5.8 (and 3.5.5) shipped with Solaris 10 keeps crashing when smbd process count hits about 500-600

Thu Sep 1 04:52:28 MDT 2011

> -----Original Message-----
> From: Thomas Nau [mailto:Thomas.Nau at uni-ulm.de]
> Sent: 1. syyskuuta 2011 13:36
> To: Matti Rintala
> Cc: samba at lists.samba.org
> Subject: Re: [Samba] Samba 3.5.8 (and 3.5.5) shipped with Solaris 10
> keeps crashing when smbd process count hits about 500-600
> 
> On 09/01/2011 10:32 AM, Matti Rintala wrote:
> > Hi,
> >
> > We are running Samba on Solaris 10 cluster as a HA service. There are
> two nodes in the cluster and Samba versions are 3.5.8 on other node and
> 3.5.5 on another. Samba build is one that ships with Solaris 10. We are
> using Sun (Oracle) LDAP for user account data so passwd and group
> databases related information is retrieved from there. Authentication
> is done against Windows 2008 AD.
> >
> > This Samba service is serving users home directories. Same data is
> also shared using NFS. We have over 11000 user accounts. During summer
> this new service was working nicely but when user count has increased
> we are experiencing severe problems. When smbd process limit hits about
> 500 Samba just stops responding and we have to restart it. Usually
> Oracle Solaris Cluster does restart but it fails because one smbd
> process won't die even with -9 signal. Nothing really crashes and at
> least for some time mother smbd keeps forking new childs so process
> count keeps increasing.
> 
> I'm not sure if any of the p* commands or truss will be of some help in
> that state. Nevertheless you could check callstack and open files using
> pfiles and pstack
> 
> If those don't help one idea that pops up in my mind is to use dtrace

Thanks for the hints.

> 
> 
> > We have opened support case to Oracle and together with them we have
> speculated that this issue might be caused by naming service and/or
> LDAP issue. So we disabled nscd but that didn't have any effect. We
> have also switched hosts' ldap_cachemgr to use more efficient LDAP
> server without success.
> 
> I doubt that as those are not kernel related and the "kill -9" issue
> point to some kernel "problem"

I'm currently installing recommended patches to one of the cluster nodes to rule out kernel related issues or other known bugs.

Matti

> 
> > Any ideas what could be wrong or any ideas how to debug the problem,
> please? We are still continuing investigations with Oracle too.
> 
> Thomas