SMB2 performance is worse than SMB1 while iometer 512byte transfer

Jones jones.kstw at gmail.com
Wed Sep 11 15:43:37 CEST 2013


Hi list,

With Iometer 512byte sequential read and write to 8GB file,
compare the IOps and found that SMB2 is slower than SMB1.
Im trying to figure out the cause with profiling tools,
this is samba-3.6.18,
here is detailed steps:

A. I/O per second and CPU usage
==============================
Althouhg both SMB1 and SMB2 got HIGH CPU load,
SMB2 spent more computing cycles on user-space than kernel-space.

Protocol R/W    IOps    %CPU    %us    %sy   %id   %si
-------------------------------------------------------
SMB1     Read  82231     98%  40.0%  51.0%  2.0%  7.0%
SMB1     Write 77184    100%  40.0%  54.0%  0.0%  6.0%
-------------------------------------------------------
SMB2     Read  23826    100%  72.3%    22.8%  0.0%  5.0%
SMB2     Write  20366    100%   70.3%   24.8%  0.0%  5.0%

B. perf top without debugging symbols loaded
==============================
With SMB2, there are 4 user-space items on the top of kernel-space
ia32_syscall.
With SMB1, kernel-space ia32_syscall is on the top.

"perf top -p <smbd_pid>" SMB2 enabled shows:
    15.50%  libc-2.6.1.so   [.] 0x00055f45
     8.01%  smbd            [.] 0x001eff27
     5.68%  libtalloc.so.2  [.] 0x00002d59
     3.90%  libc-2.6.1.so   [.] vfprintf
     2.69%  [kernel]        [k] ia32_syscall
     1.98%  libc-2.6.1.so   [.] memset
     1.88%  libc-2.6.1.so   [.] strcmp
     1.83%  libc-2.6.1.so   [.] malloc
     1.68%  libc-2.6.1.so   [.] _IO_default_xsputn
     1.56%  smbd            [.] event_add_to_poll_args
     1.31%  libtalloc.so.2  [.] _talloc_zero
     1.29%  libtevent.so.0  [.] 0x0000228a
     1.11%  libc-2.6.1.so   [.] free
     1.09%  libtalloc.so.2  [.] _talloc_free
     1.07%  libtalloc.so.2  [.] talloc_get_name
     0.99%  [e1000e]        [k] e1000_xmit_frame
 ... ...

"perf top -p <smbd_pid>" SMB1 enabled shows:
     9.56%  [kernel]             [k] ia32_syscall
     7.80%  smbd                 [.] 0x0012b633
     3.53%  libc-2.6.1.so        [.] __gettimeofday
     2.28%  libpthread-2.6.1.so  [.] read
     2.23%  libtalloc.so.2       [.] 0x00001913
     2.19%  [kernel]             [k] __ticket_spin_lock
     1.86%  [e1000e]             [k] e1000_xmit_frame
     1.46%  smbd                 [.] reply_read_and_X
     1.37%  libc-2.6.1.so        [.] __poll
     1.34%  [kernel]             [k] tcp_sendmsg
     1.34%  [kernel]             [k] copy_user_generic_string
     1.30%  libc-2.6.1.so        [.] sendfile64
     1.24%  [kernel]             [k] fget_light
     1.21%  libc-2.6.1.so        [.] __fxstat64
     1.09%  libpthread-2.6.1.so  [.] send
     1.04%  libc-2.6.1.so        [.] strcmp
     1.01%  smbd                 [.] event_add_to_poll_args
     0.96%  [kernel]             [k] tcp_recvmsg
 ... ...

C. perf top with debugging symbols loaded
==============================
OK, all debugging symbols to smbd and glibc are loaded.

"perf top -p <smbd_pid>" SMB2 enabled shows:
     6.14%  libc-2.6.1.so   [.] _int_malloc
     4.38%  libc-2.6.1.so   [.] vfprintf
     3.66%  libtalloc.so.2  [.] _talloc_free_internal
     3.07%  libc-2.6.1.so   [.] _int_free
     2.84%  [kernel]        [k] ia32_syscall
     2.05%  libc-2.6.1.so   [.] __find_specmb
     2.02%  libc-2.6.1.so   [.] memset
     1.96%  libc-2.6.1.so   [.] strcmp
     1.95%  libc-2.6.1.so   [.] _IO_default_xsputn
     1.81%  libc-2.6.1.so   [.] malloc
     1.75%  smbd            [.] event_add_to_poll_args
     1.74%  libc-2.6.1.so   [.] malloc_consolidate
     1.47%  libtalloc.so.2  [.] _talloc_zero
     1.22%  libc-2.6.1.so   [.] free
     1.18%  libtalloc.so.2  [.] _talloc_free
     1.06%  libc-2.6.1.so   [.] __gettimeofday
     1.05%  libtalloc.so.2  [.] talloc_get_name
     1.03%  libc-2.6.1.so   [.] mempcpy
     1.01%  libtalloc.so.2  [.] _talloc_get_type_abort
     1.00%  libtalloc.so.2  [.] talloc_alloc_pool
     0.97%  libc-2.6.1.so   [.] _int_realloc
     0.93%  libc-2.6.1.so   [.] do_readv
     0.93%  [e1000e]        [k] e1000_xmit_frame
 ... ...

"perf top -p <smbd_pid>" SMB1 enabled shows:
    10.28%  [kernel]             [k] ia32_syscall
     3.32%  libc-2.6.1.so        [.] __gettimeofday
     2.31%  libpthread-2.6.1.so  [.] __read_nocancel
     2.20%  [kernel]             [k] __ticket_spin_lock
     1.72%  [e1000e]             [k] e1000_xmit_frame
     1.52%  libc-2.6.1.so        [.] __fxstat64@@GLIBC_2.2
     1.39%  smbd                 [.] reply_read_and_X
     1.29%  [kernel]             [k] tcp_sendmsg
     1.27%  libtalloc.so.2       [.] _talloc_free_internal
     1.27%  libpthread-2.6.1.so  [.] __libc_send
     1.24%  libc-2.6.1.so        [.] sendfile64
     1.19%  [kernel]             [k] copy_user_generic_string
     1.18%  libc-2.6.1.so        [.] __poll
     1.13%  smbd                 [.] run_events_poll
     1.03%  libc-2.6.1.so        [.] strcmp
     0.95%  [kernel]             [k] tcp_recvmsg
 ... ...

D. Modified:
==============================
1. lib/tevent/tevent_immediate.ctevent_immediate.c:
   Comment tevent_debug in tevent_common_loop_immediate and
tevent_common_schedule_immediate,
   this ease the pain from vfprintf.
2. source3/smbd/process.c
   Move following 2 lines outside the while(True) loop.
   frame = talloc_stackframe_pool(8192)
   TALLOC_FREE(frame);

E. perf top to compare modified before and after
==============================
After modified,
1. IOps raised up 7000~10000 IOps.
2. %us droppped 5%~7%
3. %sy raised up 8%~10%

Before Modified:
Protocol R/W    IOps    %CPU    %us    %sy   %id   %si
-------------------------------------------------------
SMB2     Read  23826    100%  72.3%  22.8%  0.0%  5.0%
SMB2     Write 20366    100%  70.3% 24.8%  0.0%  5.0%

After Modified:
Protocol R/W    IOps    %CPU    %us    %sy   %id   %si
-------------------------------------------------------
SMB2     Read  33007    100%  66.0% 30.0%  0.0%  4.0%
SMB2     Write 27720    100%  65.0%  34.0%  0.0%  1.0%

F. Question
==============================
The attached file is the output as "perf top -G -p <smbd_pid>",
the user-space call stacks shows many tevent+talloc arhchitecture,
it seems that spent much more computing cycles on _int_malloc and glibc
symbols.
Is it possible to tune this asynchronous architecture for SMB2? timer etc ?
Or this guess put me wrong direction to tune the performance,
any suggestion is appreciated,
thanks.


G. Other lists
==============================
https://lists.samba.org/archive/samba-technical/2013-June/092986.html
I have already tried samba-4.0.5 but still with no luck,
the read and write IOps keeping around 20000 IOps.

https://lists.samba.org/archive/samba/2013-May/173320.html
Since my "perf top" does not show SHA256_Update,
and the singing bits in packet are all zeros (thru wireshark),
so never hit this condition.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: perf_top_g.log
Type: application/octet-stream
Size: 73728 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20130911/c94213f3/attachment-0001.obj>


More information about the samba-technical mailing list