[linux-cifs-client] RE: [RFC]

Mon Aug 20 04:47:02 GMT 2007

> -----Original Message-----
> From: oprofile-list-bounces at lists.sourceforge.net 
> [mailto:oprofile-list-bounces at lists.sourceforge.net] On 
> Behalf Of Kirill Kuvaldin
> Sent: Sunday, August 19, 2007 1:59 PM
> To: linux-cifs-client at lists.samba.org
> Cc: linux-fsdevel at vger.kernel.org; oprofile-list at lists.sourceforge.net
> Subject: [RFC]
> 
> Hi,
> 
> I tried running oprofile to identify performance bottlenecks 
> of the linux cifs client. I was actually doing the large read 
> from the samba server to the client machine over the gigabit 
> ethernet connection. The results of running oprofile 
> indicated that nearly 70% of samples were attributed to the 
> cifs_readpages() function:
> 
> of 0x00 (Unhalted core cycles) count 60000
> warning: could not check that the binary file /cifs has not 
> been modified since the profile was taken. Results may be inaccurate.
> samples  %        symbol name
> 17690    68.5951  cifs_readpages
> 1111      4.3080  cifs_demultiplex_thread
> 850       3.2960  cifs_writepages
> 768       2.9780  is_valid_oplock_break
> 747       2.8966  cifs_closedir
> 464       1.7992  SendReceive2
> 338       1.3106  sesInfoFree
> 255       0.9888  DeleteMidQEntry
> 255       0.9888  allocate_mid
> 237       0.9190  decode_negTokenInit
> 212       0.8221  SendReceive
> 212       0.8221  wait_for_response
> 184       0.7135  cifs_fsync
> 180       0.6980  header_assemble
> 168       0.6514  CIFSSMBRead
> 161       0.6243  cifs_close
> 139       0.5390  cifs_NTtimeToUnix
> ...
> 
> Looking further into "opannotate --assembly" results, I 
> noticed that virtually all sample hits were attributed to the 
> "rep movsl %ds:(%esi),%es:(%edi)" instruction
> 
>                :   129d9:       mov    0x40(%esp),%esi
>  17187 66.6447 :   129dd:       rep movsl %ds:(%esi),%es:(%edi)
>      4  0.0155 :   129df:       subl   $0x1000,0x44(%esp)
> 
> 
> which corresponds to the "memcpy(target, data, 
> PAGE_CACHE_SIZE);" line of the cifs_copy_cache_pages() 
> function, which was inlined by gcc.
> 
> What this seemed to mean was that we're doing memcpy most of 
> the cifs running time, copying the data from the temporary 
> buffer, allocated from cifs_demultiplex_thread, to the page cache.
> 
> My first thought was that if we managed to avoid doing this 
> unnecessary copy and read directly from the socket to the 
> page cache, there would be a performance boost on reads. I 
> tried modifying the cifs source code to eliminate that memcpy 
> -- in fact, I only commented out the memcpy line of
> cifs_copy_cache_pages() just to have an idea how big can be a 
> performance win if we would eliminate an unnecessary copy 
> without updating actual pages... Surprisingly enough, I 
> haven't noticed big differences between unmodified and 
> modified versions (the latter was about 3-5% faster) on 
> copying a big file on the same setup.
> 
> Are the results of oprofile inaccurate in some way or am I 
> missing something important?
> The only instruction (memcpy) taking the most of cifs running 
> time seems really odd...

Memory access instructions can take a long time if the access causes a
L3 cache miss or it can be much faster if the data is already on the L1
cache.(time will vary depending on which cache level the data is
located)
The high cost of your original memcpy() operation is probably due to
cache misses (probably L3 miss). Removing the memcpy() is probably
moving the cache misses to some point later in the code that access the
same data and which used to be faster because the data was already in
the cache (I am not sure where as I am not familiar with cifs code).
This would explain why removing the memcpy() does not provide big
savings. 
You can verify this hypothesis by profiling cache misses and confirming
where they happen in the modified and unmodfied code.
I hope this helps

Regards

Renato