Samba performance

Mon Mar 31 23:18:13 GMT 2003

Jeremy,

I apologise for the format hassle. Hope this works.

Cheers
Ravi
> Please resend with a mailer that doesn't wrap at 80 columns :-).
> 
> Jeremy.

Samba Performance testing 
==========================

1.0 Architecture:
----------------- 
Server:
CPU: Intel(R) Pentium(R) III CPU family  1266MHz
Memory: 1GB
Kernel: Linux 2.4.18
File System: xfs-1.1
Samba version: 3.0-alpha19
Network: 1 GB point to point

Client: 
1/2 GB memory and 1.6 GHZ Pentium  

1.1 Introduction:
-----------------

We have been trying to measure samba performance. The
following are our observations.

1.2 Is it samba ?
-----------------
We wanted to find out for sure whether samba was the
bottleneck.
So we did the following experiment. 

1. dbench (to measure disk TP)
2. tbench (to measure TCP/IP TP)
3. dbench+tbench: 
   In this experiment we wanted to find out whether
system, not
samba was the limitation. For each number of clients
dbench and
tbench was stated simultaneously. 
4. nbench with clients_oplocks.txt trace (to measure
samba TP)

The results are as follows

Num   dbench   tbench   dbench   tbench   min(1,2)   nbench   
clis  alone    alone    (simul   (simul
                         tbench)  dbench)
	                   (1)    (2)	
1    77.152   20.915    77.1373   19.7312  19.7312   11.5006
4   106.174   40.6007   71.2576   33.9155  33.9155   19.3349
8    93.378   56.4977   63.2581   43.745   43.745    19.8468
12   81.908   60.8616   59.0883   43.675   43.675    19.2888
16   56.834   63.6999   52.1449   41.525   41.525    19.3474
20   63.398   64.967    50.9493   41.776   41.776    19.1162
24   61.818   66.6186   50.223    41.8949  41.8949   18.9119
28   55.442   67.3411   49.1058   41.5549  41.5549   19.0702
32   54.318   69.2981   47.8511   41.9139  41.9139   18.8018
36   54.986   70.1524   45.6686   41.3715  41.3715   18.3617
40   46.994   70.8444   45.2621   41.459   41.459    18.2381
44   41.702   69.8389   42.6287   41.0206  41.0206   18.1785
48   45.988   69.8389   40.4743   40.3336  40.3336   18.1683

The nbench experiment measures samba performance with
the same work load trace used for other experiments. 
As can be seen nbench TP is much smaller than minimum
of
(1) and (2) which implies that samba is the
performance
bottleneck. (The disk configuration for the above
experiment was a 11 drive RAID 5 with LVM)

1.3 Where in Samba and what is the limitation ?:
------------------------------------------------

We observe that our system is severely CPU limited.
Here is the summary of  top -d 1 trace of CPU usage
during 
the period 16 nbench clients were active.(2 drive RAID
0 + LVM)

        User		System		Total
Mean    34.60447761     64.14477612     98.74925373
Median  35.2		63.7		99.9
Stdev   0.070189292     0.076303659     0.06342686

So it seems that more CPU time is spent in the system.
Is this compatible with what we saw in earlier Samba 
versions ? 

Then we used the Samba build in profiling facility to
get
some information about performance intensive code
paths.
We discovered that the time spent on stat calls was
excessive.
The time was more than the time spent on read or write
calls!

Here are the time consuming system calls
Name		num calls time(us)	Min(us)	Max(us)
-----		--------  -------	------	------
syscall_opendir 189841  36913656        0       396806
syscall_readdir 2329741 40225042        0       312880
syscall_open    194256  150164226       0      1245872
syscall_close   133504  41983747        0       475361
syscall_read    320496  88093084        0       350440
syscall_write   149776  90665926        0       382059
syscall_stat    1335959 145079345       0       336839
syscall_unlink  33520   101113573       0      1132776

Here are the time consuming Trans2 calls

Trans2_findfirst        57184   201725472       0     430785
Trans2_qpathinfo        147536  255836025       0     412576

and the time consuming SMB calls
SMBntcreateX    175984  95263531        0       346844
SMBdskattr      27344   63275572        0       351798
SMBreadX        320496  90593419        0       350444
SMBwriteX       149776  92584721        0       382067
SMBunlink       33520   101522665       0      1132787
SMBclose        133696  66140491        0       475414

and cache statistics are

************************ Statcache
*******************************
lookups:                        398768
misses:                         41
hits:                           398727
************************ Writecache
******************************
read_hits:                      0
abutted_writes:                 0
total_writes:                   149776
non_oplock_writes:              149776
direct_writes:                  149776
init_writes:                    0
flushed_writes[SEEK]:           0
flushed_writes[READ]:           0
flushed_writes[WRITE]:          0
flushed_writes[READRAW]:        0
flushed_writes[OPLOCK_RELEASE]: 0
flushed_writes[CLOSE]:          0
flushed_writes[SYNC]:           0
flushed_writes[SIZECHANGE]:     0
num_perfect_writes:             0
num_write_caches:               0
allocated_write_caches:         0

For the above experiment <16 clients nbench 2 Dr 
RAID0 + LVM> I am getting about ~21 MBytes/s.

Then we removed the FIND_FIRST and QUERY_PATH_INFORMATION 
calls from the clients_oplocks.txt file. We can see that 
performance improves about 6-8 MBytes/s for 16 clients.

 Name		num calls time(us)	Min(us)	Max(us)
-----		--------  -------	------	------
syscall_opendir 83009   18155570        0       306736
syscall_readdir 938078  15806346        0       314394
syscall_open    194256  163721233       0      1682098
syscall_close   133504  50548558        0       905587
syscall_read    320496  91373880        0       319341
syscall_write   149776  94024793        0       345850
syscall_stat    597492  69316075        0       312443
syscall_unlink  33520   101812395       0      1369880

As can be seen there is a substantial reduction
 in stat,readdir and opendir system call times.However 
the CPU user and system time  distribution is almost identical 
to the previous case.

To dissect the impact of stat we measured the kernel
dcache hit/miss statistics. We see that there is a very 
high hit rate at the dcache. shrink_dcache_memory was not 
called indicating that the kernel mm did not run short of 
pages.

To analyze the FIND_FIRST operation we put 
further traces in call_trans2findfirst call path. We 
realized that more than 60% of the time is spent in 
get_lanman2_dir_entry() call. And inside get_lanman2_dir_entry
 call we realized that majority of the time is spent inside vfs_stat 
call ~(46%) and ~28% of the time is spent in mask_match and exact_match 
calls.

We did a kernel profiling of a 60 client netbench run
and found out that link_path_walk,d_lookup,kmem_cache_alloc 
are visited more often when the timer interrupt occurs.
All in sys_stat call path.  

Conclusion:
-----------
We think Samba needs to optimize caching of the stat
calls. Individual stat calls (average = ~49us) are not the 
concern, but the sheer number of stat calls are. Also significant 
BW can be gained by optimizing opendir and readdir calls (dir 
stream).

Has anybody done this sort of profiling before ?
Are these results compatible/make sense ?
Are there any ongoing attempts to 
cache stat information in the user/kernel space? 

Some insights in this regard is much appreciated.

I am hoping to track down why open call is so 
expensive in a future exercise.

Thank you
Ravi

=====
------------------------------
Ravi Wijayaratne

=====
------------------------------
Ravi Wijayaratne

__________________________________________________
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com