[PATCH] ctdb-recovery: Ban a node that causes recovery failure (bug 13670)

Martin Schwenke martin at meltin.net
Thu Nov 1 04:57:59 UTC 2018


... instead of applying banning credits.

There have been a couple of cases where recovery repeatedly takes just
over 2 minutes to fail.  Therefore, banning credits expire between
failures and a continuously problematic node is never banned,
resulting in endless recoveries.  This is because it takes 2
applications of banning credits before a node is banned, which
generally involves 2 recovery failures.

The recovery helper makes up to 3 attempts to recover each database
during a single run.  If a node causes 3 failures then this is really
equivalent to 3 recovery failures in the model that existed before the
recovery helper added retries.  In that case the node would have been
banned after 2 failures.

So, instead of applying banning credits to the "most failing" node,
simply ban it directly from the recovery helper.

If multiple nodes are causing recovery failures then this can cause a
node to be banned more quickly than it might otherwise have been, even
pre-recovery-helper.  However, 90 seconds (i.e. 3 failures) is a long
time to be in recovery, so banning earlier seems like the best
approach.

Please review and maybe push...

peace & happiness,
martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ctdb-recovery-Ban-a-node-that-causes-recovery-failur.patch
Type: text/x-patch
Size: 3755 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20181101/cf956f9c/0001-ctdb-recovery-Ban-a-node-that-causes-recovery-failur.bin>


More information about the samba-technical mailing list