100% cpu utilization of Solaris x86

Jeremy Allison jra at samba.org
Wed Nov 7 09:21:03 GMT 2001


On Tue, Nov 06, 2001 at 11:56:08PM -0500, Scott Moomaw wrote:
> I've been doing a little head-scratching on our problem and want to throw
> some more thoughts (confusion ;-) into the picture.  During a period this
> morning when things were going downhill, I grabbed info from various
> tools.  This evening, I looked at a truss file that I generated over a
> several minute time period and was amazed at the number of fcntl calls
> that were occuring.  The top five used system calls during my truss period
> were as follows:
> 
> # calls function
> ------- --------
> 817585  fcntl
> 101686  read
> 89534   kill
> 5625    write
> 5304    getsockopt
> 
> As I noted earlier today, the processes that I looked at with pstack were
> all in a poll state; however, my truss file fills-in the gaps about what's
> happening.  Given the extremely high number of fcntl calls, I'm suspicous
> of it.  Looking at truss, it appears that they're happening mostly due to
> tdb_traverse calls when processes are forked or shutdown.  As the number
> of connections to the server goes up, so does the number of fcntl calls
> that must occur to traverse each tdb database.  The server that I'm seeing
> this problem on is our PDC.  Thus, it spawns a bunch of processes just to
> authenticate.  It's authenticating for four primary file servers and
> for our mail server via PAM_SMB there.  I understand that the traversals
> are to clean-up leftover data in the tdbs, but is there some way to reduce
> the number of fcntl calls that must occur?  I can't prove it but think
> that the number of fcntl calls is overwhelming on a busy system.  I don't
> know why this is exhibited on Solaris and hasn't been seen by others, but
> it's a current theory.
> 
> I don't see anything of real use in the log files.  As I noted in earlier
> messages, the "getpeername failed" errors are occuring with the
> corresponding ENOTCONN errors, but I don't see anything useful related to
> this.  I'm now thinking that this is side-effect of requests from other
> servers just being dropped because its taking too long to process while
> the system is at 100% kernel cpu utilization.
> 
> Is the theory plausible, or have I been up too long today? :-)  Could the
> volume of fcntl and the time that is required to process them be the
> problem?

No - this is a very plausible theory. Theoretically the
cleanup on exit is not needed for the locking db at least
as it will be cleaned by other processes starting up.

Can you try this simple patch to see if it fixes the problem ?

Also, from your earlier truss tests can you tell me which tdb
database is the one the exiting smbd's are spending most of
their time cleaning ? My feeling is that it will be the locking
tdb.

Jeremy

Index: locking/locking.c
===================================================================
RCS file: /data/cvs/samba/source/locking/locking.c,v
retrieving revision 1.93.2.25
diff -u -r1.93.2.25 locking.c
--- locking/locking.c   20 Oct 2001 21:23:36 -0000      1.93.2.25
+++ locking/locking.c   7 Nov 2001 17:13:00 -0000
@@ -329,8 +329,10 @@
 
                /* delete any dead locks */
 
+#if 0 /* JRATEST */
                if (!open_read_only)
                        tdb_traverse(tdb, delete_fn, &check_self);
+#endif
 
                if (tdb_close(tdb) != 0)
                        return False;




More information about the samba-technical mailing list