[PATCH] ctdb: try to fix ctdb endless banning loop

Michael Adam obnox at samba.org
Wed Jun 1 02:03:41 UTC 2016


On 2016-06-01 at 11:05 +1000, Amitay Isaacs wrote:
> Hi Michael,
> 
> On Wed, Jun 1, 2016 at 9:39 AM, Michael Adam <obnox at samba.org> wrote:
> 
> > Hi,
> >
> > We are experiencing indefinite banning of nodes in ctdb.
> > This is the pattern:
> >
> > When a inter-node-nic is brought down on a non-recmaster node,
> > the node goes to banned state. But since 4.4, this node never
> > comes back in our tests. The reason is that the db's don't
> > get frozen.
> >
> 
> Can you provide the logs when this is happening?  If the databases are not
> getting frozen, then there is something else going wrong.  Once the
> controls are sent to freeze the databases, you don't need to re-send the
> freeze controls.

In our case, the DBs were not frozen since the node was already
in RECOVERY_ACTIVE state (by election code) when getting banned.
Will try to come up with logs.

> Since you are breaking the inter-node connectivity, recmaster cannot tell
> the node to go into recovery and freeze the databases.  That's the real
> problem.  Hmm, looks like we need to add freezing of databases back in the
> banning code.
> 
> The  main reason for removing the freeze from banning was due to very
> subtle interaction between recovery and banning.  I am going to clean the
> freeze code to remove database priorities.  That should simplify re-adding
> freeze in the banning code.
> 
> 
> > Attached find my first attempt to fix this. See the commit
> > message for further explanations and analysis.
> >
> > I still need to test this more, but wanted to share the patch
> > early to get feed-back.
> >
> > Comments/review/push appreciated...
> >
> 
> This is definitely wrong.  The function ctdb_db_all_frozen() should only be
> called from ctdb daemon and not from recovery daemon.  The database frozen
> state is only stored in ctdb daemon.

Right ... thanks!
While we're discussing a better patch,
I'll make tests with the attached version
that just sends the freeze unconditionally,
omitting the check that would not work..

Cheers - Michael

-------------- next part --------------
From da4e8d4f49e6a5177dcc26dc9107861ef5c3c564 Mon Sep 17 00:00:00 2001
From: Michael Adam <obnox at samba.org>
Date: Wed, 1 Jun 2016 01:19:43 +0200
Subject: [PATCH] ctdb:recoverd: fix endless banning due to non-frozen DBs.

When the banned node got marked RECOVERY_ACTIVE, but
freezing the DBs failed (e.g. if banning happened
while recovery was set to active but dbs not banned),
then the freezing will never be tried again, and the
node will keep banning itself indefinitely, until
ctdbd is restarted.

This is a regression from 4.3, introduced with

b4357a79d916b1f8ade8fa78563fbef0ce670aa9

and

d8f3b490bbb691c9916eed0df5b980c1aef23c85

This change lets the main loop in the banned case keep
trying to freeze the dbs, hence avoiding the endless loop.
Note that we currently have no means to tell in the
recovery daemon whether the DBs are frozen, so we
send the freeze control each time..

Signed-off-by: Michael Adam <obnox at samba.org>
---
 ctdb/server/ctdb_recoverd.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index 09940dc..001d32e 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -3542,7 +3542,9 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 			DEBUG(DEBUG_ERR,(__location__ " Failed to read recmode from local node\n"));
 		}
 		if (ctdb->recovery_mode == CTDB_RECOVERY_NORMAL) {
-			DEBUG(DEBUG_ERR,("Node is stopped or banned but recovery mode is not active. Activate recovery mode and lock databases\n"));
+			DEBUG(DEBUG_ERR, ("Node is stopped or banned but "
+			      "recovery mode is not active. "
+			      "Activate recovery.\n"));
 
 			ret = ctdb_ctrl_setrecmode(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE, CTDB_RECOVERY_ACTIVE);
 			if (ret != 0) {
@@ -3550,11 +3552,20 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 
 				return;
 			}
-			ret = ctdb_ctrl_freeze(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE);
-			if (ret != 0) {
-				DEBUG(DEBUG_ERR,(__location__ " Failed to freeze node in STOPPED or BANNED state\n"));
-				return;
-			}
+		}
+
+		/*
+		 * Make sure that the databases get frozen or we will
+		 * never come out of banning!
+		 * We currently have no way of telling whether freezing
+		 * has completed here in the recovery daemon, so we just
+		 * send the freeze out unconditionally. A banned node
+		 * does not have anything useful to do anyways...
+		 */
+		ret = ctdb_ctrl_freeze(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE);
+		if (ret != 0) {
+			DEBUG(DEBUG_ERR,(__location__ " Failed to freeze node in STOPPED or BANNED state\n"));
+			return;
 		}
 
 		/* If this node is stopped or banned then it is not the recovery
-- 
2.5.5

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160601/0dae83bf/signature.sig>


More information about the samba-technical mailing list