CTDB: shared path on NFS3: successful file lock only with soft mount

Thu Jan 22 20:05:51 MST 2015

Hi Thomas,

On Thu, 22 Jan 2015 18:38:02 +0100, Thomas Hartmann
<thomas.hartmann at kit.edu> wrote:

> I stumbled over a problem, I would like to understand better:
> 
> I am running CTDB with the shared fs on NFS3 mounted on all nodes. For
> maintenance, I had to move to a different export and, thus, stoped my
> services mounted the new export rsync'ed everything, remounted the new
> export under the old mount path and restarted everything.
> 
> However, after restart (even with only one node alone) the cluster
> startup failed. The daemon could create the lock file (checked
> permissions, NFS UIDs/GIDs etc.) when manually removed - but was
> complaining not to be able to get the recovery lock (although it just
> created the file)
> 
> e.g.,
> >
> 2015/01/22 17:38:28.039885 [recoverd:22602]: Taking out recovery lock
> from recovery daemon
> 2015/01/22 17:38:28.039944 [recoverd:22602]: Take the recovery lock
> 2015/01/22 17:38:34.045139 [recoverd:22602]: ctdb_recovery_lock: Failed
> to get recovery lock on '/ctdb/fts3/CTDB_FTS3/lock'
> 2015/01/22 17:38:34.045189 [recoverd:22602]: Unable to get recovery lock
> - aborting recovery and ban ourself for 300 seconds
> 2015/01/22 17:38:34.045294 [22412]: Banning this node for 300 seconds
> 
> 
> after some testing I mounted the pathes manually and was able to startup
> CTDB with NFS 'soft'-mounted -- before I had the autofs made to
> 'hard'-mount the NFS export.
> 
> not working:
> -hard,bg,noac,intr,rsize=8192,wsize=8192
> 
> working:
> -soft,bg,noac,intr,rsize=8192,wsize=8192
> 
> Since I just replaced the IP and path in the export and did not touched
> the actual mount options, I have no idea why it worked before?
> 
> Anyway, I am mounting the export now 'soft' on all nodes and would like
> to understand how it works?
> My best guess so far is, that maybe the daemon puts an open file handle
> on the lock file and this differs somehow between 'soft' and 'hard'

From my understanding of the hard and soft options that doesn't makes
sense, so I understand your need for an explanation!

I haven't used NFS as a shared filesystem for clustered Samba so I'm not
sure of the limitations.

My suggestion would be to extract the relevant code from the
ctdb_recovery_lock() function from server/ctdb_recover.c into a
standalone program.  That way you can run some tests without involving
all of CTDB.  You could use getchar(3) or similar cause the program to
block so that you can hold the lock until you actually want to release
it.

I would be interested in know what you find out.

peace & happiness,
martin