Project: Samba Performance Monitoring

David Collier-Brown davec-b at
Mon Jul 10 14:06:29 GMT 2006

   I'll echo James' suggestion of looking at Performance Co-Pilot
(pcp), and toss in a specific suggestion about correlation.

On 7/10/06, jijjy81 at <jijjy81 at> wrote:
>> I'm a student of the University Duisburg-Essen and write my diploma
>> about "predefined rules for samba performance monitoring".
>> My task is now to develop predefined rules to detect abnormalities in
>> real time.
>> Firstly I've to understand the "normal" behaviour of a system like:
>> %IDL (percentage of time the system spent in idlemode, averaged over all
>> available cpu instances), system calls .... and  the behaviour of the
>> processes (smbd) like: %CPU (percentage of recent CPU time used by the
>> process), RSS, SIZE, IOC/s (input/output of characters per second)....!!!
>> To get these values I try to repeating several times by downloading
>> some different size of files from the samba server with the windows
>> client. By collecting and analyzing with statistics methods, I found
>> out that the values of each parameter (%CPU, IOC/s...) were not
>> constant. After correspondence with Mr. Lendecke, the cause of this
>> state is the windows client.
>> Now my question: "why does the abnormalities of the values depends on
>> the windows client??"
> Client-side caching is the most likely cause.

	There will be another abnormality in your data, after you've
started to encounter heavy load, past 100% utilization of the smbd process
in question.

   Basically you're writing an equation like
	1 TPS = X CPU + Y Mem + Z Disk I/Os + W Network I/Os + ....
in this initial case for a single transaction.

   To get good initial numbers, I'd recommend using a non-Windows
client like smbclient, or turning Oplocks off on the samba share,
to avoid client-side caching.

   However, there is an inflection point in the throughput and response
time curves after 100% utilization, after which the equation is no
longer valid: the response time will grow without bound while
the resource usage stays the same.

   The curves look like the attached gifs: throughput rises to a maximum
and then levels off, and response time starts off almost level and then
rises abruptly after the load at which the throughput hits maximum,
called N* in the diagrams.
   This happens because there is always a point at which the application
can't handle any more work, and bottlenecks on resources or code-path length.
Past that point, work sits in queue, waiting for the programs to
get around to it, and the waiting is what inflates the total response
time.  Which in turn causes the customers to throw bricks at you (;-))

    You can get a weak indication that the program is bottlenecking by
doing a benchmark and recording the maximum CPU it uses. If it uses that
much again, you'll know it's overloaded.  Alas, that means your benchmark
has to be an almost perfect match for the load that your customers are
generating.  You could easily bottleneck at 15% CPU on a disk-intensive
operation on a given server, but be able to use 25% without bottlenecking
at all if most of the requests were not disk-io related.

   It's better to record response times on a per-operation basis,
using a popular operation like "read file", and then poll the
application every so often with a light-weight read to see if
it's waiting.  If you're collecting TPS and response times, drop me
a line and I'll see if I can help.

David Collier-Brown,         | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at           |                      -- Mark Twain
(416) 223-5943
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jack-5.gif
Type: image/gif
Size: 2227 bytes
Desc: not available
Url :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jack-7.gif
Type: image/gif
Size: 2219 bytes
Desc: not available
Url :

More information about the samba-technical mailing list