svn commit: samba r21278 - in branches/SAMBA_3_0/source/smbd: .

Sat Feb 17 12:18:49 GMT 2007

On Sun, Feb 11, 2007 at 05:59:53PM -0800, Jeremy Allison wrote:
> Can you run cachegrind on both methods to see why the
> memcpy ends up taking less time ?

Ok, without the patch for 10000 runs cachegrind says:

==6739== I   refs:      863,858,854
==6739== I1  misses:        399,846
==6739== L2i misses:          9,002
==6739== I1  miss rate:        0.04%
==6739== L2i miss rate:        0.00%
==6739== 
==6739== D   refs:      813,144,141  (50,736,568 rd + 762,407,573 wr)
==6739== D1  misses:     47,305,786  ( 1,361,403 rd +  45,944,383 wr)
==6739== L2d misses:         36,191  (    21,991 rd +      14,200 wr)
==6739== D1  miss rate:         5.8% (       2.6%   +         6.0%  )
==6739== L2d miss rate:         0.0% (       0.0%   +         0.0%  )
==6739== 
==6739== L2 refs:        47,705,632  ( 1,761,249 rd +  45,944,383 wr)
==6739== L2 misses:          45,193  (    30,993 rd +      14,200 wr)
==6739== L2 miss rate:          0.0% (       0.0%   +         0.0%  )

With current code (patch in) I get:

==6516== I   refs:      830,673,905
==6516== I1  misses:        398,240
==6516== L2i misses:          8,982
==6516== I1  miss rate:        0.04%
==6516== L2i miss rate:        0.00%
==6516== 
==6516== D   refs:      812,838,899  (83,644,778 rd + 729,194,121 wr)
==6516== D1  misses:     47,302,250  ( 3,448,001 rd +  43,854,249 wr)
==6516== L2d misses:         32,801  (    21,771 rd +      11,030 wr)
==6516== D1  miss rate:         5.8% (       4.1%   +         6.0%  )
==6516== L2d miss rate:         0.0% (       0.0%   +         0.0%  )
==6516== 
==6516== L2 refs:        47,700,490  ( 3,846,241 rd +  43,854,249 wr)
==6516== L2 misses:          41,783  (    30,753 rd +      11,030 wr)
==6516== L2 miss rate:          0.0% (       0.0%   +         0.0%  )

To my eyes it seems the main one here is the 36,191 vs
32,801 L2d misses, most of them being writes. That's 10%
more L2d misses in the slower case. That case is also pretty
exactly 10% slower if measured without valgrind.

Not sure this is valid for all machines, but whenever
reading about machine architecture I get the impression that
for modern machines the main memory bus more and more
becomes the bottleneck. So I'd vote for this to get in.

Jeremy, can you merge r21278 across if you like it? Or
should I?

Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.samba.org/archive/samba-technical/attachments/20070217/918afad8/attachment.bin