[Samba] "file not found" under high-contention

starlight.2012q2 at binnacle.cx starlight.2012q2 at binnacle.cx
Wed May 9 11:54:06 MDT 2012


For several years I've been experiencing
an intermittent Samba error when running
a very intense, highly parallel build/compile

A file is reported as "not found" even though
it most certainly exists and re-running the
compile jobset always succeeds.

Samba version is 3.6.4 running on CentOS 5.8
with 64-bit kernel 2.6.18-308.4.1.el5.  Windows
side is 64-bit Window Server 2008 (NT 6.1) with
latest updates.  Used to see same problem
with W2K3 64-bit and CentOS 4 on similar

Windows machine is attached via Infiniband
directly to CentOS machine where the files
are hosted.  Other Linux systems access
the compile directory with NFSv3 over
gigabit ethernet.

"kernel oplocks = no" is set due to troublesome
behavior where open files have their modify
time temporarily set to the present as seen
from NFSv3.  This causes 'make' to rebuild
objects unnecessarily.  Since the compile
jobs never attempt to write the same file
turning off kernel oplocks appears to have
no downsides.

The new version of Samba performs better,
but this is now causing the failure to
happen more often, to the point where
it's becoming rather annoying.  Have been
hoping someone else would figure this out
and fix it, but waiting four years hasn't
done it.

I hate figuring out these sorts of problem,
but am becoming resigned to attempting it.
Can anyone suggest an efficient approach
for narrowing down and identifying where
the problem is?  I don't see writing a 
test case as possible with the information
available at present.


