[Samba] Samba4 consumes more CPU

Thiago Fernandes Crepaldi tognado at gmail.com
Wed Oct 2 11:50:38 MDT 2013


Googling around copy_user_generic_unrolled() - a kernel space function -
seen in my previous smbd profiling, I found what might be a clue for the
performance drop. It is a comment on line #31 (see below) that says:

31 /*
32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
34 * Otherwise, use copy_user_generic_unrolled.
35 */

Which makes me guess that my Atom D2701 (
http://ark.intel.com/products/59683/Intel-Atom-Processor-D2700-1M-Cache-2_13-GHz)
is not compiled with REP_GOOD nor ERMS. It is not clear to me if the
processor does support those features, but apparently it does (looking at
/proc/cpuinfo from another user's NAS -
http://www.foxnetwork.ru/index.php/en/component/content/article/121-thecus-n4800eco.html
)

__________________________________________________________________________________________

linux/arch/x86/include/asm/uaccess_64.h

Toggle line number - Style:
1 #ifndef _ASM_X86_UACCESS_64_H
2 #define _ASM_X86_UACCESS_64_H
3
4 /*
5 * User space memory access functions
6 */
7 #include <linux/compiler.h>
8 #include <linux/errno.h>
9 #include <linux/lockdep.h>
10 #include <asm/alternative.h>
11 #include <asm/cpufeature.h>
12 #include <asm/page.h>
13
14 /*
15 * Copy To/From Userspace
16 */
17
18 /* Handles exceptions in both to and from, but doesn't do access_ok */
19 __must_check unsigned long
20 copy_user_enhanced_fast_string(void *to, const void *from, unsigned len);
21 __must_check unsigned long
22 copy_user_generic_string(void *to, const void *from, unsigned len);
23 __must_check unsigned long
24 copy_user_generic_unrolled(void *to, const void *from, unsigned len);
25
26 static __always_inline __must_check unsigned long
27 copy_user_generic(void *to, const void *from, unsigned len)
28 {
29 unsigned ret;
30
31 /*
32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
34 * Otherwise, use copy_user_generic_unrolled.
35 */
36 alternative_call_2(copy_user_generic_unrolled,
37 copy_user_generic_string,
38 X86_FEATURE_REP_GOOD,
39 copy_user_enhanced_fast_string,
40 X86_FEATURE_ERMS,
41 ASM_OUTPUT2(""=a"" (ret), ""=D"" (to), ""=S"" (from),
42 ""=d"" (len)),
43 ""1"" (to), ""2"" (from), ""3"" (len)
44 : ""memory"", ""rcx"", ""r8"", ""r9"", ""r10"", ""r11"");
45 return ret;
46 }


On Tue, Oct 1, 2013 at 6:04 PM, Thiago Fernandes Crepaldi <tognado at gmail.com
> wrote:

> That is funny. Now that I replaced samba 4 and libc-2.13.so with debug
> symbols, the perf profile seems to be have changed a bit after the same
> tests !
>
> Events: 54K cycles
> -   3.06%  smbd  [kernel.kallsyms]         [k] copy_user_generic_unrolled
>    - copy_user_generic_unrolled
>         52.63% __read_nocancel
>         36.20% __write_nocancel
>         2.70% __getdents64
>         2.44% __libc_readv
>       + 2.00% do_fcntl
>         0.87% __GI___libc_read
>       + 0.77% __fxstat64
> -   2.02%  smbd  libc-2.13.so              [.] _int_malloc
>    + _int_malloc
> -   1.62%  smbd  [kernel.kallsyms]         [k] kmem_cache_alloc
>    + kmem_cache_alloc
> -   1.22%  smbd  libtalloc.so.2.0.7        [.] _talloc_free
>    + _talloc_free
> -   0.99%  smbd  libtalloc.so.2.0.7        [.]
> _talloc_free_children_internal.isra.4
>    + _talloc_free_children_internal.isra.4
> -   0.86%  smbd  libc-2.13.so              [.] __memcpy_ssse3
>    + __memcpy_ssse3
> +   0.81%  smbd  [kernel.kallsyms]         [k] kmem_cache_free
> +   0.81%  smbd  libc-2.13.so              [.] _int_free
> +   0.79%  smbd  [kernel.kallsyms]         [k] __kmalloc
> +   0.66%  smbd  libtalloc.so.2.0.7        [.] _talloc_zero
> +   0.63%  smbd  [kernel.kallsyms]         [k] link_path_walk
> +   0.63%  smbd  [kernel.kallsyms]         [k] ext4_htree_store_dirent
> +   0.55%  smbd  libtalloc.so.2.0.7        [.] talloc_alloc_pool
> +   0.55%  smbd  libc-2.13.so              [.] __memset_sse2
> +   0.53%  smbd  libc-2.13.so              [.] malloc
> +   0.53%  smbd  [kernel.kallsyms]         [k] fcntl_setlk
> +   0.52%  smbd  [kernel.kallsyms]         [k] get_page_from_freelist
> +   0.50%  smbd  libtalloc.so.2.0.7        [.] talloc_get_name
> +   0.50%  smbd  [kernel.kallsyms]         [k] tg3_start_xmit
> +   0.48%  smbd  [kernel.kallsyms]         [k] memset
> +   0.47%  smbd  libc-2.13.so              [.] free
> +   0.47%  smbd  [kernel.kallsyms]         [k] _raw_spin_lock
> +   0.45%  smbd  [kernel.kallsyms]         [k] __d_lookup_rcu
> +   0.45%  smbd  libc-2.13.so              [.] __GI___strcmp_ssse3
> +   0.44%  smbd  libtalloc.so.2.0.7        [.] _talloc_get_type_abort
> +   0.43%  smbd  [kernel.kallsyms]         [k] system_call_after_swapgs
> +   0.43%  smbd  [kernel.kallsyms]         [k] ext4_mark_iloc_dirty
>  +   0.42%  smbd  libtalloc.so.2.0.7        [.] talloc_is_parent
> +   0.41%  smbd  [kernel.kallsyms]         [k] __alloc_skb
> +   0.41%  smbd  [kernel.kallsyms]         [k] __posix_lock_file
>  +   0.40%  smbd  [kernel.kallsyms]         [k] __ext4_get_inode_loc
> +   0.39%  smbd  libc-2.13.so              [.] __strlen_sse2
> +   0.39%  smbd  [kernel.kallsyms]         [k] kfree
> +   0.39%  smbd  [kernel.kallsyms]         [k] tcp_recvmsg
> +   0.38%  smbd  libtalloc.so.2.0.7        [.] talloc_named_const
> +   0.37%  smbd  libtalloc.so.2.0.7        [.] _talloc_array
>
>
> On Mon, Sep 30, 2013 at 6:19 PM, Thiago Fernandes Crepaldi <
> tognado at gmail.com> wrote:
>
>> Agreed. For some strange reason I though perf would "follow" the new smbd
>> forked and account their data too =)
>>
>> Unfortunately, I don't have the libc symbols (at least for today) to see
>> what is going on there, but here is what I got in the child smbd process on
>> the server side. The client side is a Windows 7 Virtual machine running
>> NASPT
>>
>> Could this result mean that most of the time the performance drop I am
>> experiencing is due to libc ?
>> I've never worked with perf before, but I will still try to resolve those
>> crazy addresses
>>
>> Events: 45K cycles
>> -   7.37%  smbd  libc-2.13.so              [.] 0x11e465
>>    - 0x7ffab9f2043c
>>         41.73% 0
>>         5.32% 0x1b3fbe0
>>         5.29% 0x2c4dab0
>>         3.60% 0x1b0b130
>>         3.37% 0x1b0b2a0
>>         2.94% 0x1b5af80
>>         2.70% 0x1b0d850
>>         2.64% 0x2825fb0
>>         1.86% 0x28e06d0
>>         1.83% 0x2afcc80
>>         1.71% 0x1b2ccb0
>>         1.64% 0x2a4deb0
>>         1.63% 0x1b56e00
>>         1.51% 0x1b6bd00
>>         1.16% 0x1b49eb0
>>         1.15% 0x1b506e0
>>         1.13% 0x1b4da00
>>         1.07% 0x1b35100
>>         0.93% 0x1af9050
>>         0.92% 0x2b03680
>>         0.91% 0x2ae21f0
>>         0.90% 0x1b21210
>>         0.89% 0x1b5de80
>>         0.89% 0x1b5aa80
>>         0.89% 0x1b2e0e0
>>         0.88% 0x1b59be0
>>         0.87% 0x1b4c600
>>         0.86% 0x1b2aa20
>>         0.85% 0x1b4a940
>>         0.85% 0x1b45f50
>>         0.84% 0x1b4a6d0
>>         0.84% 0x1b23940
>>         0.82% 0x1b37210
>>         0.82% 0x1b2cf30
>>         0.82% 0x1b33320
>>         0.77% 0x2c96d50
>>         0.76% 0x202f380
>>         0.75% 0x2bd0bd0
>> 0.66% 0x1b5e1d0
>>    - 0x7ffab9f27e10
>>         37.72% 0x2f62696c2f3365
>>       + 23.78% 0
>>       + 11.24% 0x7fffc9f76d40
>>       + 6.25% set_unix_security_ctx
>>         3.13% 0x645f6e656b6f74
>>         2.46% 0x1000900000000
>>       + 2.17% 0x11b9f22aac
>>         2.16% 0x1b53000
>>       + 2.12% 0x2a29850
>>         2.08% 0xbe70f000004c4c
>>         2.01% 0x1b0af00
>>         1.94% 0x1b07390
>>         1.51% 0x1b49b00
>>         1.41% 0x2010
>>    - 0x7ffab9fc6c10
>>       + 18.08% 0
>>       + 13.63% 0x2c5fc20
>>       + 11.62% 0x2be7b10
>>       + 7.90% 0x2be8560
>>       + 6.61% 0x2a29850
>>       + 6.30% 0x2b3d6c0
>>         5.67% 0x4e6f5479706f43
>>       + 5.64% 0x29d7110
>>       + 5.54% 0x2467130
>>       + 5.53% 0x2b3d5e0
>>       + 5.31% 0x28c81a0
>>       + 4.20% 0x2c5fa30
>>       + 3.98% 0x2a98990
>>    + 0x7ffab9f20438
>>    + 0x7ffab9f2045c
>>      0x7ffab9fc8e03
>>    + 0x7ffab9fc425e
>>    + 0x7ffab9f2a715
>>    + 0x7ffab9f2a6d0
>>      0x7ffab9f1f851
>>      0x7ffab9f1f2ac
>>    + 0x7ffab9f27e25
>>    + 0x7ffab9f2a648
>>    + 0x7ffab9fc4240
>>      0x7ffab9fc8654
>>      0x7ffab9f206bf
>>    + 0x7ffab9f20548
>>    + 0x7ffab9f20bc2
>>    + 0x7ffab9f1f130
>>    + 0x7ffab9f26310
>>    + 0x7ffab9f20422
>>      0x7ffab9f1e0db
>>      0x7ffab9f1f179
>>    + 0x7ffab9f2a6f2
>>    + 0x7ffab9f20572
>>    + 0x7ffab9f2054c
>>    + 0x7ffab9fc42c5
>> -   1.72%  smbd  [kernel.kallsyms]         [k] kmem_cache_alloc
>>    + kmem_cache_alloc
>> -   1.30%  smbd  libtalloc.so.2.0.7        [.] _talloc_free
>>    + _talloc_free
>> -   1.10%  smbd  libtalloc.so.2.0.7        [.]
>> _talloc_free_children_internal.i
>>     + _talloc_free_children_internal.isra.4
>> -   1.07%  smbd  [kernel.kallsyms]         [k] copy_user_generic_unrolled
>>    + copy_user_generic_unrolled
>> -   0.95%  smbd  [kernel.kallsyms]         [k] __kmalloc
>>    + __kmalloc
>> -   0.78%  smbd  [kernel.kallsyms]         [k] ext4_htree_store_dirent
>>    + ext4_htree_store_dirent
>>    + 0x7ffab9f4f2f5
>> -   0.73%  smbd  [kernel.kallsyms]         [k] kmem_cache_free
>>    + kmem_cache_free
>> -   0.73%  smbd  [kernel.kallsyms]         [k] link_path_walk
>>    + link_path_walk
>> -   0.69%  smbd  libc-2.13.so              [.] malloc
>>    + malloc
>> -   0.69%  smbd  libtalloc.so.2.0.7        [.] _talloc_zero
>>    + _talloc_zero
>> -   0.62%  smbd  [kernel.kallsyms]         [k] fcntl_setlk
>>    + fcntl_setlk
>>    + 0x7ffabcf93238
>> -   0.59%  smbd  [kernel.kallsyms]         [k] __d_lookup_rcu
>>    + __d_lookup_rcu
>> -   0.57%  smbd  libtalloc.so.2.0.7        [.] talloc_alloc_pool
>>    + talloc_alloc_pool
>> -   0.55%  smbd  libtalloc.so.2.0.7        [.] talloc_get_name
>>    + talloc_get_name
>> -   0.55%  smbd  [kernel.kallsyms]         [k] __posix_lock_file
>>    + __posix_lock_file
>>    + 0x7ffabcf93238
>> -   0.50%  smbd  [kernel.kallsyms]         [k] _raw_spin_lock
>>    + _raw_spin_lock
>> +   0.49%  smbd  [kernel.kallsyms]         [k] tg3_start_xmit
>> +   0.48%  smbd  [kernel.kallsyms]         [k] system_call_after_swapgs
>> +   0.46%  smbd  libtalloc.so.2.0.7        [.] talloc_named_const
>> +   0.46%  smbd  [kernel.kallsyms]         [k] memset
>> +   0.46%  smbd  libtalloc.so.2.0.7        [.] _talloc_get_type_abort
>>  +   0.45%  smbd  [kernel.kallsyms]         [k] str2hashbuf_signed
>> +   0.45%  smbd  [kernel.kallsyms]         [k] kfree
>> +   0.45%  smbd  libc-2.13.so              [.] free
>> +   0.44%  smbd  [kernel.kallsyms]         [k] __alloc_skb
>> +   0.42%  smbd  libtalloc.so.2.0.7        [.] talloc_is_parent
>> +   0.41%  smbd  libtalloc.so.2.0.7        [.] _talloc_array
>>
>>
>>
>>
>>
>> On Mon, Sep 30, 2013 at 5:39 PM, Jeremy Allison <jra at samba.org> wrote:
>>
>>> On Mon, Sep 30, 2013 at 05:21:44PM -0300, Thiago Fernandes Crepaldi
>>> wrote:
>>> > Andrew, in my company we are also experiencing a higher CPU usage of
>>> Samba
>>> > 4 (smbd) if compared to Samba 3.
>>> >
>>> > In fact, it almost reaches 100% of CPU and uses all the memory during
>>> *dir
>>> > copies* (individual file copy is as good as samba 3's). I strongly
>>> believe
>>> > that this CPU usage is the responsible for a worse samba 4's
>>> throughput if
>>> > compared to Samba 3 tests.
>>> >
>>> > Giving that, I would like to contribute with this investigation and
>>> share
>>> > my data regarding perf profiling on smbd (parent process)
>>> >
>>> > Events: 7  cycles
>>> > -  90.01%  smbd  [kernel.kallsyms]  [k] copy_pte_range
>>> >      copy_pte_range
>>> >      __libc_fork
>>> >      smbd_accept_connection
>>> > -   9.77%  smbd  [kernel.kallsyms]  [k] handle_edge_irq
>>> >      handle_edge_irq
>>> >      smbd_accept_connection
>>> > -   0.22%  smbd  [kernel.kallsyms]  [k] perf_pmu_rotate_start.isra.57
>>> >      perf_pmu_rotate_start.isra.57
>>> >      __poll
>>> > -   0.00%  smbd  [kernel.kallsyms]  [k] native_write_msr_safe
>>> >      native_write_msr_safe
>>> >      __poll
>>>
>>> It's the client process that should have the interesting
>>> profile data, the parent is just going to sit there doing
>>> accept().
>>>
>>> Jeremy.
>>>
>>
>>
>>
>> --
>> Thiago Crepaldi
>>
>
>
>
> --
> Thiago Crepaldi
>



-- 
Thiago Crepaldi


More information about the samba mailing list