CTDB fail-over fails, NFS only, glusterfs backend, for XenServer VMs

Martin Gombac martin at isg.si
Fri Feb 3 02:03:58 MST 2012


Hi,

thank you Ronnie for this extensive reply, it is really helpful. Also 
thank you for helping bring ctdb to us. :-)

I will look into glusterfs first, because as you say, to wait 10+ 
minutes for lock volume to become available again, after one node in 
cluster is killed, is way to long.

Martin Gombač
ISG d.o.o.
00386 (0)1 620 75 03

On 2/2/12 9:26 PM, ronnie sahlberg wrote:
> Hi,
>
> 10 minutes must surely be something wrong?
> That is very very long.
>
>
> If you need it, I can change the maximum allowed timeout  from 100 to
> 1000 seconds,
> but,  even 100 seconds is very very long for something to become unresponsive.
> If I do this change, it will be for the master branch and not one of
> the older branches, so you would have to recompile
> ctdbd from sources.
>
> A better option could probably be to
> 1, remove the reclock file completely from /etc/sysconfig/ctdb. You
> can run ctdb without a reclock file at all if you have problems with
> locking on your filesystem. It means ctdb will no longer try to lock a
> file there and will thus no longer detect the reclock file ptroblem
> and go unhealthy.
> On the other hand, it also means that you no longer have split-brain
> protection,   so   be careful ...
> 2, tweak the NFS detection in the /etc/ctdb/evennts.d/60.nfs script.
> The script that comes in master can be modified to change it to
> arbitrarily high valuies for when it will "detect nfs is dead and
> trigger UNHEALTHY/failover"
> Again, a drawback by setting it higher is that you increase the time
> it takes for ctdb to detect a problem and trigger a recovery by moving
> clients off the node and onto another node.
>
>
>
>
> /etc/ctdb/events.d/60.nfs :
> ...
>          # check that NFS responds to rpc requests
>          if [ "$CTDB_NFS_SKIP_KNFSD_ALIVE_CHECK" != "yes" ] ; then
>              nfs_check_rpc_service "knfsd" \
>                  -ge 6 "verbose unhealthy" \
>                  -eq 4 "verbose restart" \
>                  -eq 2 "restart:bs"
>          fi
> ...
>
> the line -ge 6   tells the script that IF nfs has been see dead for 6
> monitor checks in a row then make the node unheaqlthy.
> Monitoring happen between 1-10 seconds apart depending on node status
> so it is hard to set it to "exactly xyz seconds",
> but you can experiment setting it much higher.
>
>
>
> These two changes, remove the reclock file, and the changes to
> experiment with 60.nfs  should allow tweaking the system so that it
> will survive across the node failure without making the node
> unhealthy.
>
>
> But, 10 minutes does sound like something is wrong in the underlying system.
> I dont have experience with gluster myself, but I think you should try
> to find out why/what it takes this long for your failover/recovery
> first,
> and then try the two tweaks above later if nothting else can be done.
>
>
> regards
> ronnie sahlberg
>
>
> On Fri, Feb 3, 2012 at 2:26 AM, Martin Gombac<martin at isg.si>  wrote:
>> Just an update.
>> Volume for locks took more than 10 minutes to become available again using
>> gluster as a back-end. This is why NFS was not responding and why ctdb
>> marked itself unhealthy.
>>
>> [root at s3 bricks]# time df -m
>>
>> Filesystem           1M-blocks      Used Available Use% Mounted on
>> s3.c.XX.XX:/lockers   1906819      7282   1899538   1% /mnt/lockers
>> real    11m58.676s
>>
>> I would also like to note that the maximum value that ctdb
>> EventScriptTimeout uses IS 100 seconds, even if you set it to more, like 300
>> seconds. I wish it could be set to more, so we could use dogly gluster fs.
>> :-)
>>
>> Regards,
>>
>> Martin Gombač
>> ISG d.o.o.
>> 00386 (0)1 620 75 03
>>
>> On 2/2/12 2:36 PM, Martin Gombac wrote:
>>>
>>> Hi Ronnie,
>>>
>>> thank you for your explanation. It was what was needed. GlusterFS takes
>>> some time to recover when i kill one node. I will try and increase
>>> timeout values for ctdb and hopefully it will not mark itself as
>>> unhealthy.
>>>
>>> What scalable redundant FS would you suggest besides gluster? By
>>> scalable i mean it can extend over many network attached nodes and by
>>> redundant i mean it can survive loss of a node or a spindle.
>>>
>>> Thank you.
>>> Martin
>>>
>>> On 2/2/12 10:26 AM, ronnie sahlberg wrote:
>>>>
>>>> 2012/02/01 15:09:01.290595 [ 1933]: rpcinfo: RPC: Timed out
>>>> 2012/02/01 15:09:01.290757 [ 1933]: ERROR: NFS not responding to rpc
>>>> requests
>>>> 2012/02/01 15:09:01.290968 [ 1933]: Node became UNHEALTHY. Ask
>>>> recovery master 0 to perform ip reallocation
>>>> 2012/02/01 15:09:06.309049 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>> 2012/02/01 15:09:16.326371 [ 1933]: ERROR: more than 3 consecutive
>>>> failures for 01.reclock, marking cluster unhealthy
>>>>
>>>> Looks like your underlying filesystem has died.
>>>> You took out one node and your filesystem has hung, causing knfsd to
>>>> hand inside the kernel.
>>>>
>>>> knfsd hung causing the first two lines of warning
>>>>
>>>> The last two lines of log are from ctdb itself and also indicate that
>>>> the filesystem used for the recovery lock file has hung.
>>>>
>>>>
>>>> Nothing ctdb can do here when your filesystem is dodgy.
>>>> Try a different filesystem. Other filesystems might work better.
>>>>
>>>>
>>>> regards
>>>> ronnie sahlberg
>>>>


More information about the samba-technical mailing list