Samba and HSM

Thu Jan 11 23:21:27 GMT 2001

Bryan Feir writes:
 >    Well, in theory that's possible on the NFS side at least.  NFSv3 has an
 > error code specifically for this case: NFSERR_JUKEBOX.  From RFC1813:
 > 
 >    The server initiated the request, but was not able to complete it in a
 >    timely fashion. The client should wait and then try the request with a
 >    new RPC transaction ID.  For example, this error should be returned from
 >    a server that supports hierarchical storage and receives a request to
 >    process a file that has been migrated.  In this case, the server should
 >    start the immigration process and respond to client with this error.
 > 
 > The proposed NFSv4 has a nearly identical code called NFS4ERR_DELAY.
 > 
 >    So as long as both the NFS client and server fully support NFSv3/4, the
 > client can find out when the HSM server is trying to locate a file.  Now, of
 > course, getting that information to a userland program like Samba is another
 > matter entirely...

Unfortunately, that doesn't help us -- Samba's access to the
HSM-managed filesystem is not via NFS, it runs directly on the
server.  (Although we have run into cases where users have created
symlinks to automounted directories, so we do get similar NFS-related
failures from time to time).

Since the problem seem to has been significant enough to require
something to happen immediately, I've written some fault monitoring
methods for the HA system that check to see if a user/client
combination has more than one smbd, and kill all except the newest smbd
process.  Obviously this addresses the symptoms, not the cause, but it
will at least make things manageable until a better solution is found.
It will also make it possible to do recovery on a system without 3000
smbd processes using up all of the memory/cpu.

This of course goes under the assumption that each user/client
combination should have only one smbd process.  Can anyone think of
any situations where a single user logged onto an NT workstation would
have more than one SMB connection open to the same server?

Actually, technically speaking (since this is an HA cluster), in
failover mode each user can legitimately have up to N smbd processes,
one for each logical host on the current physical host.  But these run
with separate names, config files, IP interfaces, etc, so it's easy to
sort out which is which.

I've also seen a weird failure mode where a user's first smbd process
locks up for whatever reason (NFS, HSM...), and then all of the user's
subsequent smbd processes get caught in a loop, attempting to get the
first process to break the oplocks it currently owns.  Today I saw one
user with about 20 of these.  Once I killed the oldest smbd process
that initiated the situation, all of the rest exited.  Subject for
further investigation...

-Andrew