Samba and HSM

Tue Jan 9 20:22:51 GMT 2001

Hello-

We've got a bit of a variation of the "multiple smbd processes on an
NFS-mounted filesystem" problem - I'm wondering if someone can be of
help.  In this case, it's not NFS that's the problem, but HSM (which
can cause symptoms similar to a non-responding NFS filesystem).

We are currently running Samba 2.0.6 on our Sun fileservers, primarily
for home directories and group data.  All of the filesystems that are
accessed via Samba are managed by Veritas Storage Migrator (HSM),
using optical and tape (in that order) as secondary storage.

The main symptom we are experiencing is with multiple smbd processes
per user.  The severity of the problem varies depending on the root
cause.  Here are the scenarios that come up:

1. Access to files migrated to tape (blocks smbd)

Files migrated to tape can take as long as 10 minutes to retrieve.
While attempting to access such a file, the Windows NT redirector
times out after about 45 seconds, and opens up a new connection
(spawning a second smbd process).  This happens until retrieval is
complete.  The main symptom of this problem is when a user gets
sharing violations trying to access their own files -- this is becuase
the blocking smbd process has locks on other other files, and the
"new" smbd process cannot work with these locks.  One thing that may help
matters here is to increase the redirector timeout to wait longer (if
I can ever get our NT admin folks to push out the REG file to all the
clients!).  Unfortunately, this is the least serious of our problems.

2. Full filesystems cause many smbd processes to appear

This is similar to 1. except that access to an entire filesystem is
blocked until the migration system has claimed space (i.e. migrated
files out in response to a full disk).  In this case, the scenario
above happens to every user accessing that filesystem until the space
situation is resolved.  This time varies depending on migration criteria,
responsiveness of the secondary media, and sysadmin response time.

3. General HSM failure

This is obviously the worst situation - the HSM system stops
responding due to some failure, access to all filesystems is blocked,
and the smbd process load doubles after 45 seconds, and then continues
to increase by that amount every 45 seconds, until the underlying
problem is fixed.  1500+ smbd processes using up all of the system
memory and process space makes it difficult to do this.

Does anyone have ANY suggestions to getting around this problem?  The
main problem here is that the client is allowed to time out and tell
Samba to fork off another smbd process.  One suggestion I've seen is
to set keepalives in the smb.conf file (i.e. keepalive = 30), but
whether this will work will depend on which process is handling the
keepalives.  If the children smbd processes handle the keepalives,
it probably won't help matters since smbd won't be able to send/receive
keepalives when it is blocking on a read() or write() system call
(which is what happens when HSM is unable to immediately satisfy a
request).

Oh, to make matters worse, this is a two-node SunCluster HA cluster,
with separate Samba configs per logical host (binding to separate
logical interfaces).  This means it's not unusual for a user to have
two smbd processes running when both logical hosts are failed over to
the same physical host...

Any hints??  We're probably the only site insane enough to combine
Samba + HSM + SunCluster.. :-) :-S

-Andrew Cherry
 UNIX System Admin
 Cummins Engine Company