[PATCH] ctdb-recovery: Ban a node that causes recovery failure (bug 13670)
amitay at gmail.com
Sun Nov 4 23:55:07 UTC 2018
On Thu, Nov 1, 2018 at 3:58 PM Martin Schwenke via samba-technical
<samba-technical at lists.samba.org> wrote:
> ... instead of applying banning credits.
> There have been a couple of cases where recovery repeatedly takes just
> over 2 minutes to fail. Therefore, banning credits expire between
> failures and a continuously problematic node is never banned,
> resulting in endless recoveries. This is because it takes 2
> applications of banning credits before a node is banned, which
> generally involves 2 recovery failures.
> The recovery helper makes up to 3 attempts to recover each database
> during a single run. If a node causes 3 failures then this is really
> equivalent to 3 recovery failures in the model that existed before the
> recovery helper added retries. In that case the node would have been
> banned after 2 failures.
> So, instead of applying banning credits to the "most failing" node,
> simply ban it directly from the recovery helper.
> If multiple nodes are causing recovery failures then this can cause a
> node to be banned more quickly than it might otherwise have been, even
> pre-recovery-helper. However, 90 seconds (i.e. 3 failures) is a long
> time to be in recovery, so banning earlier seems like the best
> Please review and maybe push...
Pushed to autobuild.
More information about the samba-technical