转发: file operation is interrupted whenusing ctdb+nfs

Fri Jan 5 02:13:46 UTC 2018

On Fri, 5 Jan 2018 08:28:52 +1000, ronnie sahlberg
<ronniesahlberg at gmail.com> wrote:

> On Fri, Jan 5, 2018 at 8:00 AM, Martin Schwenke via samba-technical
> <samba-technical at lists.samba.org> wrote:
> > On Thu, 4 Jan 2018 18:32:26 +0800 (CST), <zhu.shangzhong at zte.com.cn>
> > wrote:

> >> There are 3 CTDB nodes and 3 nfs-ganesha servers.  
> >  
> >> Their IP address is:           192.168.1.10,  192.168.1.11,  192.1.12.  
> >  
> >> The CTDB public IP address is: 192.168.1.30,  192.168.1.31,  192.168.1.32.  
> >  
> >> The client IP is 192.168.1.20. The NFS export directory is mounted
> >> to the client with public IP 192.168.1.30.  
> >  
> >> I checked the CTDB logs, the public IP 192.168.1.30 was moved to
> >> another node(IP: 192.168.1.32)  
> >  
> >> when the nfs-server(IP: 192.168.1.10) process was killed.  
> >
> > OK, that seems good.  :-)
> >
> > * When do you see the "stale file handle" message?  Immediately when
> >   the NFS Ganesha server is killed or after the failover?
> >
> >   If it happens immediately when the server is killed then CTDB is not
> >   involved and you need to understand what is happening at the NFS
> >   level.
> >
> > * Are you able to repeat the test against a single NFS Ganesha server
> >   on a single node?
> >
> >   This would involve killing the server, seeing what happens to the cp
> >   command on the client, checking if the file still exists in the
> >   server filesystem, and then restarting the server.
> >
> >   If killing the NFS Ganesha server causes the incomplete copy of the
> >   file to be deleted without communicating a failure to the client
> >   then this could explain the "stale file handle" message.
> >
> >   If this can't be made to work then it probably also isn't possible
> >   by adding more complexity with CTDB.
> >
> > By the way, if you are able to reply inline instead of "top-posting"
> > then it is easier to respond to each part of your reply.  :-)

> AFAIrecall,
> hitless NFS failover requires that the NFS filehandles remain
> invariant across the nodes in the cluster.
> I.e. regardless which node you point to, the same file will always map
> to the exact same filehandle.
> (Stale filehandle just means : "I don't know which file this refers
> to" and it would either be caused by the NFS server (Ganesha) losing
> the inode<->filehandle mapping state when Ganesha is restarted
> or it could mean that the underlying filesystem does not have the
> capability to make this possible from the server.)
> 
> GPFS/SpectrumScale does guarantee this for knfs.ko (and Ganesha) as
> long as you are careful and ensure that the fsid for the backend
> filesystem is the same across all the nodes.
> 
> 
> You would have to check if this is even possible to do with cephfs
> since in order to get this guarantee you will need support from the
> backing filesystem.
> There is likely not anything that CTDB can do here since it is an
> interaction between Ganesha and cephfs.
> 
> 
> One way to test for this would be to just do a NFSv3/LOOKUP to the
> same file from several Ganesha nodes in the cluster  and verify with
> wireshark that
> the filehandles are identical regardless which node you use to access the file.
> 
> With a little bit of effort, you can even automate this fully if you
> want to add this as a check for automatic testing.
> The way to do this would be to use libnfs, since it can expose the
> underlying nfs filehandle.
> You could write a small test program using libnfs that would connect
> to multiple different ip's/nodes in the cluster, then
> use nfs_open() to fetch a filehandle for the same file on different
> nodes and then just compare the underlying filehandle in the
> libnfs filehandle.
> I don't remember if dereferencing this structure is part of the public
> API or not, and too lazy to check right now, so you might
> need to include libnfs-private.h if not.

Nice summary.  Thanks, Ronnie!

... and you can check device#/inode# consistency in the cluster
filesystem like this:

# onnode all stat -c '%d:%i' /clusterfs/data/foo

>> NODE: 10.0.0.31 <<
21:52494

>> NODE: 10.0.0.32 <<
21:52494

>> NODE: 10.0.0.33 <<
21:52494

While Samba provides a way of dealing with inconsistent device#s
(https://www.samba.org/samba/docs/man/manpages/vfs_fileid.8.html) I'm
not sure if NFS Ganesha also has something like that.

peace & happiness,
martin