100% cpu utilization of Solaris x86

Tue Nov 6 20:56:02 GMT 2001

I've been doing a little head-scratching on our problem and want to throw
some more thoughts (confusion ;-) into the picture.  During a period this
morning when things were going downhill, I grabbed info from various
tools.  This evening, I looked at a truss file that I generated over a
several minute time period and was amazed at the number of fcntl calls
that were occuring.  The top five used system calls during my truss period
were as follows:

# calls function
------- --------
817585  fcntl
101686  read
89534   kill
5625    write
5304    getsockopt

As I noted earlier today, the processes that I looked at with pstack were
all in a poll state; however, my truss file fills-in the gaps about what's
happening.  Given the extremely high number of fcntl calls, I'm suspicous
of it.  Looking at truss, it appears that they're happening mostly due to
tdb_traverse calls when processes are forked or shutdown.  As the number
of connections to the server goes up, so does the number of fcntl calls
that must occur to traverse each tdb database.  The server that I'm seeing
this problem on is our PDC.  Thus, it spawns a bunch of processes just to
authenticate.  It's authenticating for four primary file servers and
for our mail server via PAM_SMB there.  I understand that the traversals
are to clean-up leftover data in the tdbs, but is there some way to reduce
the number of fcntl calls that must occur?  I can't prove it but think
that the number of fcntl calls is overwhelming on a busy system.  I don't
know why this is exhibited on Solaris and hasn't been seen by others, but
it's a current theory.

I don't see anything of real use in the log files.  As I noted in earlier
messages, the "getpeername failed" errors are occuring with the
corresponding ENOTCONN errors, but I don't see anything useful related to
this.  I'm now thinking that this is side-effect of requests from other
servers just being dropped because its taking too long to process while
the system is at 100% kernel cpu utilization.

Is the theory plausible, or have I been up too long today? :-)  Could the
volume of fcntl and the time that is required to process them be the
problem?

Scott

------------------------------------------------------------------------
 Scott Moomaw, Network Administrator              Scott at Bridgewater.edu
 Bridgewater College, IT Center
 Bridgewater, VA  22812
 Phone (540) 828 - 8000  x5437              FAX:  (540) 828 - 5493