CTDB fail-over fails, NFS only, glusterfs backend, for XenServer VMs

Thu Feb 2 06:28:53 MST 2012

On 2/2/12 10:15 AM, Orlando Richards wrote:
> Hi Martin,
>
>
> On -10/01/37 20:59, Martin Gombac wrote:
>> Hi All,
>>
>> i have a problem with my CTDB 2 node NFS cluster while nodes take over
>> foreign resources.
>>
>> Only NFS HA, 2 nodes for now, will grow later.
>> CentOS 6.2: 2.6.32-220.4.1.el6.x86_64
>> CTDB: 1.0.114.3
>> Gluster 3.2.5 for shared storage.
>> XenServer 5.6 SP2 for NFS client (= 5.*).
>>
>> Once both nodes are up, the NFS service works using any of the two nodes
>> concurrently (round robin DNS). But once i shutdown one NFS storage
>> node, everything fails. :-)
>
> When you say "shut down", what do you mean? Do you mean shutting down
> ctdb (service ctdb stop), or doing a shutdown of the node?
>
> The reason I ask is because the event script timeout below looks to me
> like the filesystem has paused (possibly due to shutting down one of the
> file system servers, thus causing it to go into recovery mode), and then
> the NFS script which writes the tickle info to the filesystem is timing
> out. We see this even with GPFS under certain circumstances. Some of the
> GPFS timeouts are set to 60 seconds, so as a workaround I've done:
>
> ctdb setvar EventScriptTimeout=90
>
> and added:
> CTDB_SET_EventScriptTimeout=90
> to /etc/sysconfig/ctdb
>
> (changing from the default of 30). We've only just started noticing this
> though (previously, we didn't manage NFS with CTDB), so I don't know if
> this tweak will work or not, or what the repercussions of such a large
> timeout will be!
>
> Hope that helps,
>
> --
> Orlando
>
>

Hi,

i mean shutting down the whole node, either by shutdown -h now or by 
unplugging power cord.

Thank you Orlando for you suggestion. I got the same feeling. GlusterFS 
probably does pause when it looses one node. Will just increase timeout 
values.

Funny thing, is that it worked in my test environment, using just one 
CentOS 6.* as a client, but failed when i put it into pre-production 
with a bunch of VMs. :-)

Thank you for your suggestion.

Regards,
M.