[PATCH] ctdb-recovery: Ban a node that causes recovery failure (bug 13670)

Amitay Isaacs amitay at gmail.com
Sun Nov 4 23:55:07 UTC 2018


On Thu, Nov 1, 2018 at 3:58 PM Martin Schwenke via samba-technical
<samba-technical at lists.samba.org> wrote:
>
> ... instead of applying banning credits.
>
> There have been a couple of cases where recovery repeatedly takes just
> over 2 minutes to fail.  Therefore, banning credits expire between
> failures and a continuously problematic node is never banned,
> resulting in endless recoveries.  This is because it takes 2
> applications of banning credits before a node is banned, which
> generally involves 2 recovery failures.
>
> The recovery helper makes up to 3 attempts to recover each database
> during a single run.  If a node causes 3 failures then this is really
> equivalent to 3 recovery failures in the model that existed before the
> recovery helper added retries.  In that case the node would have been
> banned after 2 failures.
>
> So, instead of applying banning credits to the "most failing" node,
> simply ban it directly from the recovery helper.
>
> If multiple nodes are causing recovery failures then this can cause a
> node to be banned more quickly than it might otherwise have been, even
> pre-recovery-helper.  However, 90 seconds (i.e. 3 failures) is a long
> time to be in recovery, so banning earlier seems like the best
> approach.
>
> Please review and maybe push...

Pushed to autobuild.

Amitay.



More information about the samba-technical mailing list