[PATCH] ctdb: try to fix ctdb endless banning loop

Amitay Isaacs amitay at gmail.com
Wed Jun 1 01:05:20 UTC 2016


Hi Michael,

On Wed, Jun 1, 2016 at 9:39 AM, Michael Adam <obnox at samba.org> wrote:

> Hi,
>
> We are experiencing indefinite banning of nodes in ctdb.
> This is the pattern:
>
> When a inter-node-nic is brought down on a non-recmaster node,
> the node goes to banned state. But since 4.4, this node never
> comes back in our tests. The reason is that the db's don't
> get frozen.
>

Can you provide the logs when this is happening?  If the databases are not
getting frozen, then there is something else going wrong.  Once the
controls are sent to freeze the databases, you don't need to re-send the
freeze controls.

Since you are breaking the inter-node connectivity, recmaster cannot tell
the node to go into recovery and freeze the databases.  That's the real
problem.  Hmm, looks like we need to add freezing of databases back in the
banning code.

The  main reason for removing the freeze from banning was due to very
subtle interaction between recovery and banning.  I am going to clean the
freeze code to remove database priorities.  That should simplify re-adding
freeze in the banning code.


> Attached find my first attempt to fix this. See the commit
> message for further explanations and analysis.
>
> I still need to test this more, but wanted to share the patch
> early to get feed-back.
>
> Comments/review/push appreciated...
>

This is definitely wrong.  The function ctdb_db_all_frozen() should only be
called from ctdb daemon and not from recovery daemon.  The database frozen
state is only stored in ctdb daemon.

Amitay.


More information about the samba-technical mailing list