[Samba] Samba4 consumes more CPU
Thiago Fernandes Crepaldi
tognado at gmail.com
Wed Oct 2 11:50:38 MDT 2013
Googling around copy_user_generic_unrolled() - a kernel space function -
seen in my previous smbd profiling, I found what might be a clue for the
performance drop. It is a comment on line #31 (see below) that says:
31 /*
32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
34 * Otherwise, use copy_user_generic_unrolled.
35 */
Which makes me guess that my Atom D2701 (
http://ark.intel.com/products/59683/Intel-Atom-Processor-D2700-1M-Cache-2_13-GHz)
is not compiled with REP_GOOD nor ERMS. It is not clear to me if the
processor does support those features, but apparently it does (looking at
/proc/cpuinfo from another user's NAS -
http://www.foxnetwork.ru/index.php/en/component/content/article/121-thecus-n4800eco.html
)
__________________________________________________________________________________________
linux/arch/x86/include/asm/uaccess_64.h
Toggle line number - Style:
1 #ifndef _ASM_X86_UACCESS_64_H
2 #define _ASM_X86_UACCESS_64_H
3
4 /*
5 * User space memory access functions
6 */
7 #include <linux/compiler.h>
8 #include <linux/errno.h>
9 #include <linux/lockdep.h>
10 #include <asm/alternative.h>
11 #include <asm/cpufeature.h>
12 #include <asm/page.h>
13
14 /*
15 * Copy To/From Userspace
16 */
17
18 /* Handles exceptions in both to and from, but doesn't do access_ok */
19 __must_check unsigned long
20 copy_user_enhanced_fast_string(void *to, const void *from, unsigned len);
21 __must_check unsigned long
22 copy_user_generic_string(void *to, const void *from, unsigned len);
23 __must_check unsigned long
24 copy_user_generic_unrolled(void *to, const void *from, unsigned len);
25
26 static __always_inline __must_check unsigned long
27 copy_user_generic(void *to, const void *from, unsigned len)
28 {
29 unsigned ret;
30
31 /*
32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
34 * Otherwise, use copy_user_generic_unrolled.
35 */
36 alternative_call_2(copy_user_generic_unrolled,
37 copy_user_generic_string,
38 X86_FEATURE_REP_GOOD,
39 copy_user_enhanced_fast_string,
40 X86_FEATURE_ERMS,
41 ASM_OUTPUT2(""=a"" (ret), ""=D"" (to), ""=S"" (from),
42 ""=d"" (len)),
43 ""1"" (to), ""2"" (from), ""3"" (len)
44 : ""memory"", ""rcx"", ""r8"", ""r9"", ""r10"", ""r11"");
45 return ret;
46 }
On Tue, Oct 1, 2013 at 6:04 PM, Thiago Fernandes Crepaldi <tognado at gmail.com
> wrote:
> That is funny. Now that I replaced samba 4 and libc-2.13.so with debug
> symbols, the perf profile seems to be have changed a bit after the same
> tests !
>
> Events: 54K cycles
> - 3.06% smbd [kernel.kallsyms] [k] copy_user_generic_unrolled
> - copy_user_generic_unrolled
> 52.63% __read_nocancel
> 36.20% __write_nocancel
> 2.70% __getdents64
> 2.44% __libc_readv
> + 2.00% do_fcntl
> 0.87% __GI___libc_read
> + 0.77% __fxstat64
> - 2.02% smbd libc-2.13.so [.] _int_malloc
> + _int_malloc
> - 1.62% smbd [kernel.kallsyms] [k] kmem_cache_alloc
> + kmem_cache_alloc
> - 1.22% smbd libtalloc.so.2.0.7 [.] _talloc_free
> + _talloc_free
> - 0.99% smbd libtalloc.so.2.0.7 [.]
> _talloc_free_children_internal.isra.4
> + _talloc_free_children_internal.isra.4
> - 0.86% smbd libc-2.13.so [.] __memcpy_ssse3
> + __memcpy_ssse3
> + 0.81% smbd [kernel.kallsyms] [k] kmem_cache_free
> + 0.81% smbd libc-2.13.so [.] _int_free
> + 0.79% smbd [kernel.kallsyms] [k] __kmalloc
> + 0.66% smbd libtalloc.so.2.0.7 [.] _talloc_zero
> + 0.63% smbd [kernel.kallsyms] [k] link_path_walk
> + 0.63% smbd [kernel.kallsyms] [k] ext4_htree_store_dirent
> + 0.55% smbd libtalloc.so.2.0.7 [.] talloc_alloc_pool
> + 0.55% smbd libc-2.13.so [.] __memset_sse2
> + 0.53% smbd libc-2.13.so [.] malloc
> + 0.53% smbd [kernel.kallsyms] [k] fcntl_setlk
> + 0.52% smbd [kernel.kallsyms] [k] get_page_from_freelist
> + 0.50% smbd libtalloc.so.2.0.7 [.] talloc_get_name
> + 0.50% smbd [kernel.kallsyms] [k] tg3_start_xmit
> + 0.48% smbd [kernel.kallsyms] [k] memset
> + 0.47% smbd libc-2.13.so [.] free
> + 0.47% smbd [kernel.kallsyms] [k] _raw_spin_lock
> + 0.45% smbd [kernel.kallsyms] [k] __d_lookup_rcu
> + 0.45% smbd libc-2.13.so [.] __GI___strcmp_ssse3
> + 0.44% smbd libtalloc.so.2.0.7 [.] _talloc_get_type_abort
> + 0.43% smbd [kernel.kallsyms] [k] system_call_after_swapgs
> + 0.43% smbd [kernel.kallsyms] [k] ext4_mark_iloc_dirty
> + 0.42% smbd libtalloc.so.2.0.7 [.] talloc_is_parent
> + 0.41% smbd [kernel.kallsyms] [k] __alloc_skb
> + 0.41% smbd [kernel.kallsyms] [k] __posix_lock_file
> + 0.40% smbd [kernel.kallsyms] [k] __ext4_get_inode_loc
> + 0.39% smbd libc-2.13.so [.] __strlen_sse2
> + 0.39% smbd [kernel.kallsyms] [k] kfree
> + 0.39% smbd [kernel.kallsyms] [k] tcp_recvmsg
> + 0.38% smbd libtalloc.so.2.0.7 [.] talloc_named_const
> + 0.37% smbd libtalloc.so.2.0.7 [.] _talloc_array
>
>
> On Mon, Sep 30, 2013 at 6:19 PM, Thiago Fernandes Crepaldi <
> tognado at gmail.com> wrote:
>
>> Agreed. For some strange reason I though perf would "follow" the new smbd
>> forked and account their data too =)
>>
>> Unfortunately, I don't have the libc symbols (at least for today) to see
>> what is going on there, but here is what I got in the child smbd process on
>> the server side. The client side is a Windows 7 Virtual machine running
>> NASPT
>>
>> Could this result mean that most of the time the performance drop I am
>> experiencing is due to libc ?
>> I've never worked with perf before, but I will still try to resolve those
>> crazy addresses
>>
>> Events: 45K cycles
>> - 7.37% smbd libc-2.13.so [.] 0x11e465
>> - 0x7ffab9f2043c
>> 41.73% 0
>> 5.32% 0x1b3fbe0
>> 5.29% 0x2c4dab0
>> 3.60% 0x1b0b130
>> 3.37% 0x1b0b2a0
>> 2.94% 0x1b5af80
>> 2.70% 0x1b0d850
>> 2.64% 0x2825fb0
>> 1.86% 0x28e06d0
>> 1.83% 0x2afcc80
>> 1.71% 0x1b2ccb0
>> 1.64% 0x2a4deb0
>> 1.63% 0x1b56e00
>> 1.51% 0x1b6bd00
>> 1.16% 0x1b49eb0
>> 1.15% 0x1b506e0
>> 1.13% 0x1b4da00
>> 1.07% 0x1b35100
>> 0.93% 0x1af9050
>> 0.92% 0x2b03680
>> 0.91% 0x2ae21f0
>> 0.90% 0x1b21210
>> 0.89% 0x1b5de80
>> 0.89% 0x1b5aa80
>> 0.89% 0x1b2e0e0
>> 0.88% 0x1b59be0
>> 0.87% 0x1b4c600
>> 0.86% 0x1b2aa20
>> 0.85% 0x1b4a940
>> 0.85% 0x1b45f50
>> 0.84% 0x1b4a6d0
>> 0.84% 0x1b23940
>> 0.82% 0x1b37210
>> 0.82% 0x1b2cf30
>> 0.82% 0x1b33320
>> 0.77% 0x2c96d50
>> 0.76% 0x202f380
>> 0.75% 0x2bd0bd0
>> 0.66% 0x1b5e1d0
>> - 0x7ffab9f27e10
>> 37.72% 0x2f62696c2f3365
>> + 23.78% 0
>> + 11.24% 0x7fffc9f76d40
>> + 6.25% set_unix_security_ctx
>> 3.13% 0x645f6e656b6f74
>> 2.46% 0x1000900000000
>> + 2.17% 0x11b9f22aac
>> 2.16% 0x1b53000
>> + 2.12% 0x2a29850
>> 2.08% 0xbe70f000004c4c
>> 2.01% 0x1b0af00
>> 1.94% 0x1b07390
>> 1.51% 0x1b49b00
>> 1.41% 0x2010
>> - 0x7ffab9fc6c10
>> + 18.08% 0
>> + 13.63% 0x2c5fc20
>> + 11.62% 0x2be7b10
>> + 7.90% 0x2be8560
>> + 6.61% 0x2a29850
>> + 6.30% 0x2b3d6c0
>> 5.67% 0x4e6f5479706f43
>> + 5.64% 0x29d7110
>> + 5.54% 0x2467130
>> + 5.53% 0x2b3d5e0
>> + 5.31% 0x28c81a0
>> + 4.20% 0x2c5fa30
>> + 3.98% 0x2a98990
>> + 0x7ffab9f20438
>> + 0x7ffab9f2045c
>> 0x7ffab9fc8e03
>> + 0x7ffab9fc425e
>> + 0x7ffab9f2a715
>> + 0x7ffab9f2a6d0
>> 0x7ffab9f1f851
>> 0x7ffab9f1f2ac
>> + 0x7ffab9f27e25
>> + 0x7ffab9f2a648
>> + 0x7ffab9fc4240
>> 0x7ffab9fc8654
>> 0x7ffab9f206bf
>> + 0x7ffab9f20548
>> + 0x7ffab9f20bc2
>> + 0x7ffab9f1f130
>> + 0x7ffab9f26310
>> + 0x7ffab9f20422
>> 0x7ffab9f1e0db
>> 0x7ffab9f1f179
>> + 0x7ffab9f2a6f2
>> + 0x7ffab9f20572
>> + 0x7ffab9f2054c
>> + 0x7ffab9fc42c5
>> - 1.72% smbd [kernel.kallsyms] [k] kmem_cache_alloc
>> + kmem_cache_alloc
>> - 1.30% smbd libtalloc.so.2.0.7 [.] _talloc_free
>> + _talloc_free
>> - 1.10% smbd libtalloc.so.2.0.7 [.]
>> _talloc_free_children_internal.i
>> + _talloc_free_children_internal.isra.4
>> - 1.07% smbd [kernel.kallsyms] [k] copy_user_generic_unrolled
>> + copy_user_generic_unrolled
>> - 0.95% smbd [kernel.kallsyms] [k] __kmalloc
>> + __kmalloc
>> - 0.78% smbd [kernel.kallsyms] [k] ext4_htree_store_dirent
>> + ext4_htree_store_dirent
>> + 0x7ffab9f4f2f5
>> - 0.73% smbd [kernel.kallsyms] [k] kmem_cache_free
>> + kmem_cache_free
>> - 0.73% smbd [kernel.kallsyms] [k] link_path_walk
>> + link_path_walk
>> - 0.69% smbd libc-2.13.so [.] malloc
>> + malloc
>> - 0.69% smbd libtalloc.so.2.0.7 [.] _talloc_zero
>> + _talloc_zero
>> - 0.62% smbd [kernel.kallsyms] [k] fcntl_setlk
>> + fcntl_setlk
>> + 0x7ffabcf93238
>> - 0.59% smbd [kernel.kallsyms] [k] __d_lookup_rcu
>> + __d_lookup_rcu
>> - 0.57% smbd libtalloc.so.2.0.7 [.] talloc_alloc_pool
>> + talloc_alloc_pool
>> - 0.55% smbd libtalloc.so.2.0.7 [.] talloc_get_name
>> + talloc_get_name
>> - 0.55% smbd [kernel.kallsyms] [k] __posix_lock_file
>> + __posix_lock_file
>> + 0x7ffabcf93238
>> - 0.50% smbd [kernel.kallsyms] [k] _raw_spin_lock
>> + _raw_spin_lock
>> + 0.49% smbd [kernel.kallsyms] [k] tg3_start_xmit
>> + 0.48% smbd [kernel.kallsyms] [k] system_call_after_swapgs
>> + 0.46% smbd libtalloc.so.2.0.7 [.] talloc_named_const
>> + 0.46% smbd [kernel.kallsyms] [k] memset
>> + 0.46% smbd libtalloc.so.2.0.7 [.] _talloc_get_type_abort
>> + 0.45% smbd [kernel.kallsyms] [k] str2hashbuf_signed
>> + 0.45% smbd [kernel.kallsyms] [k] kfree
>> + 0.45% smbd libc-2.13.so [.] free
>> + 0.44% smbd [kernel.kallsyms] [k] __alloc_skb
>> + 0.42% smbd libtalloc.so.2.0.7 [.] talloc_is_parent
>> + 0.41% smbd libtalloc.so.2.0.7 [.] _talloc_array
>>
>>
>>
>>
>>
>> On Mon, Sep 30, 2013 at 5:39 PM, Jeremy Allison <jra at samba.org> wrote:
>>
>>> On Mon, Sep 30, 2013 at 05:21:44PM -0300, Thiago Fernandes Crepaldi
>>> wrote:
>>> > Andrew, in my company we are also experiencing a higher CPU usage of
>>> Samba
>>> > 4 (smbd) if compared to Samba 3.
>>> >
>>> > In fact, it almost reaches 100% of CPU and uses all the memory during
>>> *dir
>>> > copies* (individual file copy is as good as samba 3's). I strongly
>>> believe
>>> > that this CPU usage is the responsible for a worse samba 4's
>>> throughput if
>>> > compared to Samba 3 tests.
>>> >
>>> > Giving that, I would like to contribute with this investigation and
>>> share
>>> > my data regarding perf profiling on smbd (parent process)
>>> >
>>> > Events: 7 cycles
>>> > - 90.01% smbd [kernel.kallsyms] [k] copy_pte_range
>>> > copy_pte_range
>>> > __libc_fork
>>> > smbd_accept_connection
>>> > - 9.77% smbd [kernel.kallsyms] [k] handle_edge_irq
>>> > handle_edge_irq
>>> > smbd_accept_connection
>>> > - 0.22% smbd [kernel.kallsyms] [k] perf_pmu_rotate_start.isra.57
>>> > perf_pmu_rotate_start.isra.57
>>> > __poll
>>> > - 0.00% smbd [kernel.kallsyms] [k] native_write_msr_safe
>>> > native_write_msr_safe
>>> > __poll
>>>
>>> It's the client process that should have the interesting
>>> profile data, the parent is just going to sit there doing
>>> accept().
>>>
>>> Jeremy.
>>>
>>
>>
>>
>> --
>> Thiago Crepaldi
>>
>
>
>
> --
> Thiago Crepaldi
>
--
Thiago Crepaldi
More information about the samba
mailing list