Fixed: OpLocks caused the corruptions/slowness (Was: How Samba let us down)
Chris de Vidal
cdevidal at yahoo.com
Wed Oct 23 22:07:00 GMT 2002
My first post, for reference:
When the new NT server's hard drive died, we decided
to keep hobbling along on Samba. Meanwhile, my
supervisor was searching around on OpLock issues on
Google and he saw other people that were having
similar problems. We disabled all OpLocks (kernel,
level I and II, kernel at the global level, level X at
the share) early this morning, and since then things
have been fine! Yesterday and the day before, the
problem appeared quickly, so (knock on wood), I think
we did fix it. Time will tell.
Yes, disabling OpLocks was the ONLY change. See the
bottom for what I think the problem was.
I got so many emails on this thread, I decided to sum
up the answers to some of the questions:
* I doubt adding a WINS server would have fixed the
problem, because the random slowness was ONLY
happening on the new server AFTER connecting (and the
client cached the IP in the NetBIOS cache). ALL other
servers were just fine, all of the time. But I would
like to add a WINS server soon, anyway.
* We are not using any Samba print facilities but
print queues on NT (explained in the first email, but
it was buried in there). Lpr isn't even installed.
* We are using RedHat 7.3 (no ACLs included) but
created a custom kernel (2.4.19) with ext3 ACL support
and installed all of the userland ACL tools.
* Nothing but Samba on Linux is accessing the files -
no NFS, no file copies, scans, etc.
* The corruption was missing records. It would
interrupt the print process and the Opus analysis
indicated hundreds of records were missing. It would
happen in random places in print files (hundreds of
megs to gigs in size), and seldomly would not happen
I have since learned that the print preprocessing
server (Elixir's Opus) works with large flat database
files (glorified spreadsheets) and uses several
processes spread across multiple servers,* to apply
the data to laser printer templates. The Opus server
ONLY accessed our server using Samba; no other Linux
software had been installed, like nfs or lpr.
* I think. It might be one server with many
processes. Here is the Opus website:
This scenario sounds like the corruption one might
experience with Access (which ALSO is a flat,
glorified spreadsheet database often accessed by
multiple processes/users) and OpLocks. As I mentioned
above, my supervisor found other people with similar
problems. I also got confirmation from a friend and
technical author (he contributed to of the more
notable Samba books). If it is _officially_
recognized by the developers as a caveat, it ought to
be put into the docs/manpages. I apologize if it IS
there but I missed it.
Anyway, it appears to have been fixed. I don't yet
know what kind of performance hit we will see, but so
far, so good.
So if *you* see similar problems, first try disabling
ALL OpLocks (kernel at the global level, the other 2
at the share level). We might reenable kernel then
regular then level2 oplocks later to see if it was
just one particular type.
Thanks to everyone who responded!
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
More information about the samba-technical