Re: 答复: Re: 转发: file operation is interruptedwhenusing ctdb+nfs

Fri Jan 5 17:27:52 UTC 2018

On Fri, Jan 5, 2018 at 7:26 PM,  <zhu.shangzhong at zte.com.cn> wrote:
> Thanks ronnie and martin.
>
>> >When do you see the "stale file handle" message? Immediately when
>> >the NFS Ganesha server is killed or after the failover?
> The "stale file handle" message will be output after the failover.
>
>> > * Are you able to repeat the test against a single NFS Ganesha server
>> > on a single node?
> No, the client operation will be hung, and the operation will continue after the NFS Ganesha process was started.
>
> [root at ceph ~]# onnode all stat -c '%d:%i' /tmp/test/nfs-test-1.iso
>>> NODE: 192.168.1.10 <<
> 39:1099511633885
>
>>> NODE: 192.168.1.11 <<
> 40:1099511633885
>
>>> NODE: 192.168.1.12 <<
> 40:1099511633885

This is likely a problem. The filehandle is usually composed of
#device/#inode (and a few other things) so if the #device
differs between the nodes, then the filehandle will differ as well.
Node 0 has a different #device than the other nodes so I bet that
failover to/from node 0 will always end up with stale filehandle.
(while failover between node 1 and node 2 might work.)

I have created a small tool in libnfs that can be used to print the
filehandle for a file :
https://github.com/sahlberg/libnfs

It is an example utility so use "./configure --enable-examples" to build it.

Then run it like this:
./examples/nfs-fh nfs://127.0.0.1/data/sahlberg/rbtree

It will print the NFS filehandle for the specified nfs object.
This filehandle must be identical on all nodes in order for failover
to work between the nodes.

Martin, the nfs-fh tool might be useful to add to ctdb.
onnode all nfs-fh nfs://127.0.0.1/<path-to-object-in-cluster-fs>
or something like it could be used to just verify that all nodes are
configured properly for NFS.

Maybe even a event script check that all is ok ?

>
> ------------------原始邮件------------------
> 发件人： <samba-technical at lists.samba.org>;
> 收件人： <ronniesahlberg at gmail.com>;
> 抄送人：朱尚忠10137461; <samba-technical at lists.samba.org>;
> 日 期 ：2018年01月05日 10:17
> 主 题 ：Re: 转发: file operation is interruptedwhenusing ctdb+nfs
> On Fri, 5 Jan 2018 08:28:52 +1000, ronnie sahlberg
> <ronniesahlberg at gmail.com> wrote:
>
>> On Fri, Jan 5, 2018 at 8:00 AM, Martin Schwenke via samba-technical
>> <samba-technical at lists.samba.org> wrote:
>> > On Thu, 4 Jan 2018 18:32:26 +0800 (CST), <zhu.shangzhong at zte.com.cn>
>> > wrote:
>
>> >> There are 3 CTDB nodes and 3 nfs-ganesha servers.
>> >
>> >> Their IP address is:           192.168.1.10,  192.168.1.11,  192.1.12.
>> >
>> >> The CTDB public IP address is: 192.168.1.30,  192.168.1.31,  192.168.1.32.
>> >
>> >> The client IP is 192.168.1.20. The NFS export directory is mounted
>> >> to the client with public IP 192.168.1.30.
>> >
>> >> I checked the CTDB logs, the public IP 192.168.1.30 was moved to
>> >> another node(IP: 192.168.1.32)
>> >
>> >> when the nfs-server(IP: 192.168.1.10) process was killed.
>> >
>> > OK, that seems good.  :-)
>> >
>> > * When do you see the "stale file handle" message?  Immediately when
>> >   the NFS Ganesha server is killed or after the failover?
>> >
>> >   If it happens immediately when the server is killed then CTDB is not
>> >   involved and you need to understand what is happening at the NFS
>> >   level.
>> >
>> > * Are you able to repeat the test against a single NFS Ganesha server
>> >   on a single node?
>> >
>> >   This would involve killing the server, seeing what happens to the cp
>> >   command on the client, checking if the file still exists in the
>> >   server filesystem, and then restarting the server.
>> >
>> >   If killing the NFS Ganesha server causes the incomplete copy of the
>> >   file to be deleted without communicating a failure to the client
>> >   then this could explain the "stale file handle" message.
>> >
>> >   If this can't be made to work then it probably also isn't possible
>> >   by adding more complexity with CTDB.
>> >
>> > By the way, if you are able to reply inline instead of "top-posting"
>> > then it is easier to respond to each part of your reply.  :-)
>
>> AFAIrecall,
>> hitless NFS failover requires that the NFS filehandles remain
>> invariant across the nodes in the cluster.
>> I.e. regardless which node you point to, the same file will always map
>> to the exact same filehandle.
>> (Stale filehandle just means : "I don't know which file this refers
>> to" and it would either be caused by the NFS server (Ganesha) losing
>> the inode<->filehandle mapping state when Ganesha is restarted
>> or it could mean that the underlying filesystem does not have the
>> capability to make this possible from the server.)
>>
>> GPFS/SpectrumScale does guarantee this for knfs.ko (and Ganesha) as
>> long as you are careful and ensure that the fsid for the backend
>> filesystem is the same across all the nodes.
>>
>>
>> You would have to check if this is even possible to do with cephfs
>> since in order to get this guarantee you will need support from the
>> backing filesystem.
>> There is likely not anything that CTDB can do here since it is an
>> interaction between Ganesha and cephfs.
>>
>>
>> One way to test for this would be to just do a NFSv3/LOOKUP to the
>> same file from several Ganesha nodes in the cluster  and verify with
>> wireshark that
>> the filehandles are identical regardless which node you use to access the file.
>>
>> With a little bit of effort, you can even automate this fully if you
>> want to add this as a check for automatic testing.
>> The way to do this would be to use libnfs, since it can expose the
>> underlying nfs filehandle.
>> You could write a small test program using libnfs that would connect
>> to multiple different ip's/nodes in the cluster, then
>> use nfs_open() to fetch a filehandle for the same file on different
>> nodes and then just compare the underlying filehandle in the
>> libnfs filehandle.
>> I don't remember if dereferencing this structure is part of the public
>> API or not, and too lazy to check right now, so you might
>> need to include libnfs-private.h if not.
>
> Nice summary.  Thanks, Ronnie!
>
> .... and you can check device#/inode# consistency in the cluster
> filesystem like this:
>
> # onnode all stat -c '%d:%i' /clusterfs/data/foo
>
>>> NODE: 10.0.0.31 <<
> 21:52494
>
>>> NODE: 10.0.0.32 <<
> 21:52494
>
>>> NODE: 10.0.0.33 <<
> 21:52494
>
> While Samba provides a way of dealing with inconsistent device#s
> (https://www.samba.org/samba/docs/man/manpages/vfs_fileid.8.html) I'm
> not sure if NFS Ganesha also has something like that.
>
> peace & happiness,
> martin