Samba performance

Ravi Wijayaratne ravi_wija at yahoo.com
Mon Mar 31 22:34:33 GMT 2003


Samba Performance testing 
==========================

1.0 Architecture:
----------------- 
Server:
CPU: Intel(R) Pentium(R) III CPU family  1266MHz
Memory: 1GB
Kernel: Linux 2.4.18
File System: xfs-1.1
Samba version: 3.0-alpha19
Network: 1 GB point to point

Client: 
1/2 GB memory and 1.6 GHZ Pentium  

1.1 Introduction:
-----------------

We have been trying to measure samba performance. The
following are our observations.

1.2 Is it samba ?
-----------------
We wanted to find out for sure whether samba was the
bottleneck.
So we did the following experiment. 

1. dbench (to measure disk TP)
2. tbench (to measure TCP/IP TP)
3. dbench+tbench: 
   In this experiment we wanted to find out whether
system, not
samba was the limitation. For each number of clients
dbench and
tbench was stated simultaneously. 
4. nbench with clients_oplocks.txt trace (to measure
samba TP)

The results are as follows

Num	dbench	tbench	  dbench  tbench  min(1,2) nbench   
clients alone    alone    (simul  (simul
			  tbench) dbench)
	                   (1)    (2)	
1       77.152  20.915  77.1373 19.7312 19.7312
11.5006
4       106.174 40.6007 71.2576 33.9155 33.9155
19.3349
8       93.378  56.4977 63.2581 43.745  43.745 
19.8468
12      81.908  60.8616 59.0883 43.675  43.675 
19.2888
16      56.834  63.6999 52.1449 41.5259 41.5259
19.3474
20      63.398  64.9676 50.9493 41.776  41.776 
19.1162
24      61.818  66.6186 50.223  41.8949 41.8949
18.9119
28      55.442  67.3411 49.1058 41.5549 41.5549
19.0702
32      54.318  69.2981 47.8511 41.9139 41.9139
18.8018
36      54.986  70.1524 45.6686 41.3715 41.3715
18.3617
40      46.994  70.8444 45.2621 41.459  41.459 
18.2381
44      41.702  69.8389 42.6287 41.0206 41.0206
18.1785
48      45.988  69.8389 40.4743 40.3336 40.3336
18.1683

The nbench experiment measures samba performance with
the same work load trace used for other experiments. 
As can be seen nbench TP is much smaller than minimum
of
(1) and (2) which implies that samba is the
performance
bottleneck. (The disk configuration for the above
experiment was a 11 drive RAID 5 with LVM)

1.3 Where in Samba and what is the limitation ?:
------------------------------------------------

We observe that our system is severely CPU limited.
Here is the summary of  top -d 1 trace of CPU usage
during 
the period 16 nbench clients were active.(2 drive RAID
0 + LVM)

        User		System		Total
Mean    34.60447761     64.14477612     98.74925373
Median  35.2		63.7		99.9
Stdev   0.070189292     0.076303659     0.06342686

So it seems that more CPU time is spent in the system.
Is this compatible with what we saw in earlier Samba 
versions ? 

Then we used the Samba build in profiling facility to
get
some information about performance intensive code
paths.
We discovered that the time spent on stat calls was
excessive.
The time was more than the time spent on read or write
calls!

Here are the time consuming system calls
Name		num calls time(us)	Min(us)	Max(us)
-----		--------  -------	------	------
syscall_opendir 189841  36913656        0       396806
syscall_readdir 2329741 40225042        0       312880
syscall_open    194256  150164226       0      
1245872
syscall_close   133504  41983747        0       475361
syscall_read    320496  88093084        0       350440
syscall_write   149776  90665926        0       382059
syscall_stat    1335959 145079345       0       336839
syscall_unlink  33520   101113573       0      
1132776

Here are the time consuming Trans2 calls

Trans2_findfirst        57184   201725472       0     
 430785
Trans2_qpathinfo        147536  255836025       0     
 412576

and the time consuming SMB calls
SMBntcreateX    175984  95263531        0       346844
SMBdskattr      27344   63275572        0       351798
SMBreadX        320496  90593419        0       350444
SMBwriteX       149776  92584721        0       382067
SMBunlink       33520   101522665       0      
1132787
SMBclose        133696  66140491        0       475414

and cache statistics are


************************ Statcache
*******************************
lookups:                        398768
misses:                         41
hits:                           398727
************************ Writecache
******************************
read_hits:                      0
abutted_writes:                 0
total_writes:                   149776
non_oplock_writes:              149776
direct_writes:                  149776
init_writes:                    0
flushed_writes[SEEK]:           0
flushed_writes[READ]:           0
flushed_writes[WRITE]:          0
flushed_writes[READRAW]:        0
flushed_writes[OPLOCK_RELEASE]: 0
flushed_writes[CLOSE]:          0
flushed_writes[SYNC]:           0
flushed_writes[SIZECHANGE]:     0
num_perfect_writes:             0
num_write_caches:               0
allocated_write_caches:         0

For the above experiment <16 clients nbench 2 Dr RAID
0 + LVM> I am
getting about ~21 MBytes/s.

Then we removed the FIND_FIRST and
QUERY_PATH_INFORMATION calls from
the clients_oplocks.txt file. We can see that
performance improves
about 6-8 MBytes/s for 16 clients.

 Name		num calls time(us)	Min(us)	Max(us)
-----		--------  -------	------	------
syscall_opendir 83009   18155570        0       306736
syscall_readdir 938078  15806346        0       314394
syscall_open    194256  163721233       0      
1682098
syscall_close   133504  50548558        0       905587
syscall_read    320496  91373880        0       319341
syscall_write   149776  94024793        0       345850
syscall_stat    597492  69316075        0       312443
syscall_unlink  33520   101812395       0      
1369880


As can be seen there is a substantial reduction in
stat,readdir and
opendir system call times.However the CPU user and
system time 
distribution
is identical to the previous case.

To dissect the impact of stat we measured the kernel
dcache hit/miss
statistics. We see that there is a very high hit rate
at the dcache.
shrink_dcache_memory was not called indicating that
the kernel mm did
not run short of pages.

To analyze the FIND_FIRST operation we put further
traces in 
call_trans2findfirst call path. We realized that more
than 60%
of the time is spent in get_lanman2_dir_entry() call.
And inside
get_lanman2_dir_entry call we realized that majority
of the time
is spent inside vfs_stat call ~(46%) and ~28% of the
time is spent
in mask_match and exact_match calls.

We did a kernel profiling of a 60 client netbench run
and found
out that link_path_walk,d_lookup,kmem_cache_alloc are
visited more often when the timer interrupt occurs.
All in sys_stat call path.  


Conclusion:
-----------
We think Samba needs to optimize caching of the stat
calls. Individual 
stat
calls (average = 49us) are not the concern, but the
sheer number of
stat calls are. Also significant BW can be gained by
optimizing opendir
and readdir calls (dir stream).

Has anybody done this sort of profiling before ?
Are these results compatible ?
Are there any ongoing attempts to cache stat
information ? 

Some insights in this regard is much appreciated.

I am hoping to track down why open call is so 
expensive in a future exercise.

 
Thank you
Ravi


=====
------------------------------
Ravi Wijayaratne

__________________________________________________
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com


More information about the samba-technical mailing list