[PATCH] Add infrastructure for gathering statistics on SMB messages

Wed Feb 4 18:06:42 GMT 2009

Todd Stecher wrote:
> Isilon has a centralized performance monitoring system for all of its
> wire protocols, including CIFS, and requires entry points into the SMBD
> process for gathering interesting SMB message characteristics such as
> the per-command message size, command, sub-command, some specific
> ioctls, response latency, and the connected identity.  This patch adds a
> pluggable module system to gather that data. 
>
> 1. Perfcount module system
>
> Unfortunately, reusing VFS wasn't an option.  Many of the interesting
> statistics are gathered during pre-connect operations.   It was
> reasonable to assume that other folks might be interested in specific
> SMB message details without enabling full profiling.  
>
> This patch introduces a lightweight pluggable perfcounter system to
> allow for custom statistics gathering through samba modules.  Every
> perfcount module must support each of the monitoring interfaces (
> exposed through struct smb_perfcount_handlers in smb_perfcount.h).
> There can be only 1 perfcount module for a process - they cannot be
> chained.
>
> 2. Data gathering
>
> Each handler has a corresponding macro which checks for the presence of
> a perfcount module, and stores interesting state on the parent structure
> (typically a request structure, but also delayed / pending messages).
> The handler for each monitored item (command, subcommand, ioctl,
> lengths, identity) has its own macro - they are very granular, for
> example:
>
> #define SMB_PERFCOUNT_SET_OP(_pcd_,_op_) 
> #define SMB_PERFCOUNT_SET_MSGLEN_IN(_pcd_,_in_)
>
> The patch intersperses macros throughout the codebase to track the
> lifetime and characteristics of an individual request, ending when the
> request is sent to the client (srv_send_smb / sendfile()).
>
> 3. OneFS perfcount module
>
> The OneFS / Isilon implementation of a perfcount module is also attached
> as an example.
>
>
> Comments appreciated.
>
> Tx,
> Todd 
>
>
>   
Something that may be important to a storage-oriented company is
distinguishing
between total response time and transfer time.

In general, the time between the request and the first response, latency,
is the part which has a queue-like behavior: it increases slowly until
you near 100% utilization and then goes up severely when the
system "hits the wall".  That's what capacity planners pay a lot of
attention to, as it can be modeled with a queue, fairly accurately.
It's all the processing that has to happen before the real transfer starts.

The other part of the response time, transfer time, is interesting
when you're doing large transfers of data, such as when reading
a remote disk. It should start rising roughly/more-or less-with
demand,  and at higher demand levels get quite linear and predictable,
until it hits some internal limit, a different wall.

Mixing these up can cause your model to lie and your brain to
explode. Been there, done that (;-))

 I recommend that for transfers you measure the start time (t0)
time of first response (t1) and time of last response (t2),
not just t0 and t2. From this you can compute
latency ::= t1- t0
transfer time ::= t2-t1
response time ::= latency + transfer
throughput ::= bytes/transfer time
and a count of these operations is TPS, transactions per second.
from this you can calculate expected maximum TPS per CPU,
current load as a percentage of maximum, queue length,
and response time degradation under load.  Only the last
two involve queue theory, and even they are just algebra.

Drop me a line if you want any more information...

--dave (at work, but still reading samba-technical) c-b
-- 
David Collier-Brown                 | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                      |                      -- Mark Twain
cell: (647) 833-9377