[PATCH] ctdb: try to fix ctdb endless banning loop

Michael Adam obnox at samba.org
Tue May 31 23:39:36 UTC 2016


Hi,

We are experiencing indefinite banning of nodes in ctdb.
This is the pattern:

When a inter-node-nic is brought down on a non-recmaster node,
the node goes to banned state. But since 4.4, this node never
comes back in our tests. The reason is that the db's don't
get frozen.

Attached find my first attempt to fix this. See the commit
message for further explanations and analysis.

I still need to test this more, but wanted to share the patch
early to get feed-back.

Comments/review/push appreciated...

Thanks - Michael
-------------- next part --------------
From 53c6965165ad155b1d365c7686b81be8096c0252 Mon Sep 17 00:00:00 2001
From: Michael Adam <obnox at samba.org>
Date: Wed, 1 Jun 2016 01:19:43 +0200
Subject: [PATCH] ctdb:recoverd: fix endless banning due to non-frozen DBs.

When the banned node got marked RECOVERY_ACTIVE, but
freezing the DBs failed (e.g. if banning happened
while recovery was set to active but dbs not banned),
then the freezing will never be tried again, and the
node will keep banning itself indefinitely, until
ctdbd is restarted.

This is a regression from 4.3, introduced with

b4357a79d916b1f8ade8fa78563fbef0ce670aa9

and

d8f3b490bbb691c9916eed0df5b980c1aef23c85

This change lets the main loop in the banned case keep
trying to freeze the dbs if they are not frozen, hence
avoiding the endless loop.

Signed-off-by: Michael Adam <obnox at samba.org>
---
 ctdb/server/ctdb_recoverd.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index 09940dc..6bdffab 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -3542,7 +3542,9 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 			DEBUG(DEBUG_ERR,(__location__ " Failed to read recmode from local node\n"));
 		}
 		if (ctdb->recovery_mode == CTDB_RECOVERY_NORMAL) {
-			DEBUG(DEBUG_ERR,("Node is stopped or banned but recovery mode is not active. Activate recovery mode and lock databases\n"));
+			DEBUG(DEBUG_ERR, ("Node is stopped or banned but "
+			      "recovery mode is not active. "
+			      "Activate recovery.\n"));
 
 			ret = ctdb_ctrl_setrecmode(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE, CTDB_RECOVERY_ACTIVE);
 			if (ret != 0) {
@@ -3550,6 +3552,15 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 
 				return;
 			}
+		}
+
+		/*
+		 * Make sure to re-try freezing if we could not complete
+		 * it last time, or we will never come out of banning!
+		 */
+		if (!ctdb_db_all_frozen(ctdb)) {
+			DEBUG(DEBUG_ERR, ("Node is stopped or banned but not "
+			      "all databases are frozen. Freeze databases.\n"));
 			ret = ctdb_ctrl_freeze(ctdb, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE);
 			if (ret != 0) {
 				DEBUG(DEBUG_ERR,(__location__ " Failed to freeze node in STOPPED or BANNED state\n"));
-- 
2.5.5

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160601/258e4ead/signature.sig>


More information about the samba-technical mailing list