[PATCH] ctdb-recovery: Update timeout and number of retries during recovery

Martin Schwenke martin at meltin.net
Mon Jun 6 04:12:55 UTC 2016


On Fri, 3 Jun 2016 16:07:46 +1000, Martin Schwenke <martin at meltin.net>
wrote:

> On Fri, 3 Jun 2016 15:25:40 +1000, Amitay Isaacs <amitay at gmail.com>
> wrote:
> 
> > The timeout RecoverTimeout (default 120) is used for control messages
> > sent during the recovery.  If any of the nodes does not respond to any
> > of the recovery control messages for RecoverTimeout seconds, then it
> > will cause a failure of recovery of a database.  Recovery helper will
> > retry the recovery for a database 5 times.
> > 
> > In the worst case, if a database could not be recovered within 5 attempts,
> > a total of 600 seconds would have passed.  During this time period other
> > timeouts will be triggered causing unnecessary failures as follows:
> > 
> > 1. During the recovery, even though recoverd is processing events,
> >    it does not send a ping message to ctdb daemon.  If a ping message is
> >    not received for RecdPingTimeout (default 60) seconds, then ctdb will
> >    count it as unresponsive recovery daemon.  If the recovery daemon
> >    fails for RecdFailCount (default 10) times, then ctdb daemon will
> >    restart recovery daemon.  So after 600 seconds, ctdb daemon will
> >    restart recovery daemon.
> > 
> > 2. If ctdb daemon stays in recovery for RecoveryDropAllIPs (default 120),
> >    then it will drop all the public addresses.  This will cause all
> >    SMB client to be disconnected unnecessarily.  The released public
> >    addresses will not be taken over till the recovery is complete.
> > 
> > To avoid dropping of IPs and restarting recovery daemon during a delayed
> > recovery, adjust RecoverTimeout to 30 seconds and limit number of
> > retries for recovering a database to 3.  If we don't hear from a node
> > for more than 25 seconds, then the node is considered disconnected.
> > So 30 seconds is sufficient timeout for controls during recovery.
> > 
> > Please review and push.  
> 
> Reviewed-by: Martin Schwenke <martin at meltin.net>
> 
> Let's see what else there is to push...  :-)

... and pushed...

peace & happiness,
martin



More information about the samba-technical mailing list