[SCM] Samba Shared Repository - branch master updated

Mon Nov 16 10:48:03 UTC 2015

The branch, master has been updated
       via  44bf7c2 ctdb-recoverd: Factor out recovery master validation
       via  e44957f ctdb-recmaster: Update capabilities before calling first election
       via  c5e50a4 ctdb-recoverd: Move VNN map retrieval to where it is needed
       via  d1f996a ctdb-recoverd: Drop explicit check for recovery lock
       via  1499f3e ctdb-recoverd: Simplify using TALLOC_FREE()
       via  050e64b ctdb-recoverd: Clarify that recmaster is being set on the current node
       via  0833e47 ctdb-recoverd: Do not sanity check recovery master with local daemon
       via  d8decd0 ctdb-recoverd: Don't retrieve recovery master from local daemon
       via  e90cab7 ctdb-recoverd: Explicitly set initial recovery master to unknown
       via  018077f ctdb-recoverd: Do not set recovery master during recovery
       via  4b37cc7 ctdb-recoverd: Have recovery daemon remember election result
       via  6f88375 ctdb-recoverd: Clarify recovery master validation logic
       via  74fa62c WHATSNEW: Document CTDB tunable change
       via  9166c30 ctdb-daemon: Rename EventScriptTimeoutCount to MonitorTimeoutCount
       via  55ad4d8 ctdb-daemon: Move script timeout count into monitor state
       via  0d5db1c ctdb-daemon: Reset script timeout count in monitor code
       via  a33b50d ctdb-daemon: Do not bother printing script timeout count
       via  134ede8 ctdb-doc: Correct documentation for tunables for script timeout
       via  bda2d9d ctdb-ipalloc: Don't consider runstates in the IP takeover code
       via  9ea3188 ctdb-ipalloc: Check for available IPs, not runstate, in takeover run
       via  66574c9 ctdb-ipalloc: A VNN can only host IPs if node is in RUNNING runstate
      from  6b4a961 ctdb-build: Remove ctdb-common-util subsystem

https://git.samba.org/?p=samba.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit 44bf7c2a12e0bd806552f8cbe0b013a5d07e0218
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Oct 27 16:43:07 2015 +1100

    ctdb-recoverd: Factor out recovery master validation
    
    Starting to untangle cluster management, database recovery and public
    IP allocation.  This is a non-trivial subset of the cluster management
    code that runs in the recovery daemon on all nodes.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    
    Autobuild-User(master): Amitay Isaacs <amitay at samba.org>
    Autobuild-Date(master): Mon Nov 16 11:47:45 CET 2015 on sn-devel-104

commit e44957fc8ba67e90f5ded77c6e9ae59ad59b42a0
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Oct 27 15:09:33 2015 +1100

    ctdb-recmaster: Update capabilities before calling first election
    
    Capabilities are used when computing an election result so having them
    up-to-date seems like a good idea.
    
    Also update several instances of an ambiguous comment.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit c5e50a474b6ddac05cab0116da42c62a615c6c79
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Oct 27 14:35:09 2015 +1100

    ctdb-recoverd: Move VNN map retrieval to where it is needed
    
    The VNN map is only needed on the recovery master, so no need for all
    recovery daemons to retrieve it.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit d1f996a50f1da36e7046fe28cdb97285e7cf00a5
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Oct 27 14:32:48 2015 +1100

    ctdb-recoverd: Drop explicit check for recovery lock
    
    This is already handled in update_recovery_lock(), which is called
    immediately before.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 1499f3e301db856d10e481adf5be9b01999a08e4
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 16:00:55 2015 +1100

    ctdb-recoverd: Simplify using TALLOC_FREE()
    
    The only non-obvious part here is dropping the setting of the nodemap
    local variable to NULL.  If the following control succeeds then it is
    set, otherwise return and it doesn't matter.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 050e64b6478fb17e3c69ba24cf68f630da9f691a
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 15:27:12 2015 +1100

    ctdb-recoverd: Clarify that recmaster is being set on the current node
    
    That is, using CTDB_CURRENT_NODE makes this more obvious.
    
    Also fix incorrect error messages.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 0833e478c36761b546be5f9ceadac81573825b22
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 15:05:08 2015 +1100

    ctdb-recoverd: Do not sanity check recovery master with local daemon
    
    Each recovery daemon knows who the recmaster is and is in sync with
    its local daemon.  The recovery master is running this check so do not
    bother checking with its local daemon - both agree that it is the
    recovery master.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit d8decd0b1d6446ab079b41e4b8d99e88b34c6e6d
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 15:33:01 2015 +1100

    ctdb-recoverd: Don't retrieve recovery master from local daemon
    
    The recovery daemon already knows which node is the master.  This
    relies on rec->recmaster being correctly initialised and correctly set
    during elections.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit e90cab70739cc7c39600c065389f7bdad6038031
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Nov 10 13:54:47 2015 +1100

    ctdb-recoverd: Explicitly set initial recovery master to unknown
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 018077f3b024b4d2862a8c1ae6758b88b7889b3e
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Oct 22 21:54:58 2015 +1100

    ctdb-recoverd: Do not set recovery master during recovery
    
    Recovery should not do cluster management functions.  Setting the
    recovery master should only be done via an election.
    
    Main loop will determine if recovery master is inconsistent across the
    cluster and force an election if necessary.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 4b37cc7cf659bec33c8a4c1bb9d5b3eaccc7c9e5
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 14:32:41 2015 +1100

    ctdb-recoverd: Have recovery daemon remember election result
    
    The recovery daemon pushes knowledge of recovery master election
    progress/result to local daemon.  It then retrieves that information
    again.
    
    Instead, have the recovery daemon reliably track election
    progress/result in rec->recmaster so it doesn't need to be retrieved.
    Be careful to maintain consistency by only doing this when the local
    daemon has been updated.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 6f8837528f43519a98d82aa85424c1e145dc98ee
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 21 16:19:00 2015 +1100

    ctdb-recoverd: Clarify recovery master validation logic
    
    There can be no holes in the nodemap.  Even if a node has been deleted
    it will take a slot in the nodemap.  The only exception is that the
    nodemap shrinks if nodes are deleted from the end.  That should never
    include the master because a node should be shutdown before being
    deleted, and an election should already have take place.
    
    To avoid walking off the end of the nodemap nodes array just confirm
    that the master node's PNN is a valid index into the array.  No need
    to walk through the nodemap.
    
    After this, in this section of the code j is now invalid.  So use the
    master's PNN to index into the nodemap.  This is safe.
    
    In the process, clean up some log messages to avoid saying "Force
    reelection".  It's just an "election".
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 74fa62c685069425e6e15783a0eb290e0778ce65
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Nov 13 15:23:14 2015 +1100

    WHATSNEW: Document CTDB tunable change
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 9166c30a41997b8d94222abe551053dacd2ca0fa
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 17:03:01 2015 +1100

    ctdb-daemon: Rename EventScriptTimeoutCount to MonitorTimeoutCount
    
    This only applies to monitor events so renaming clarifies this.
    
    Note that this change is not backward compatible.  Users with
    
      CTDB_SET_EventScriptTimeoutCount=<n>
    
    in their configuration will get failures when starting CTDB but the
    cause will be clearly logged.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 55ad4d80d491af5964429c5a6ac65ac515309060
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 16:51:59 2015 +1100

    ctdb-daemon: Move script timeout count into monitor state
    
    It is only used by the monitoring code.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 0d5db1c0072cf56904ab34aded73a13dc78a4153
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 16:42:41 2015 +1100

    ctdb-daemon: Reset script timeout count in monitor code
    
    This is the only place it is used.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit a33b50d07f10b3541dc6f6c8148416bb95871f8c
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 16:39:18 2015 +1100

    ctdb-daemon: Do not bother printing script timeout count
    
    It is only updated for monitor events, so it is meaningless here.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 134ede80e8165c4a1fe8a660c56bbd0e751d7565
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Oct 27 15:18:25 2015 +1100

    ctdb-doc: Correct documentation for tunables for script timeout
    
    * The defaults for EventScriptTimeout and EventScriptTimeoutCount are
      wrong.
    
    * EventScriptTimeout is the total time for all enabled scripts that
      are run for an event, not a single event script.
    
    * EventScriptTimeoutCount only applies to monitor events.
    
    * EventScriptUnhealthyOnTimeout is obsolete, so remove it.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit bda2d9de2b188a3f3804ed198e9b59ffec5ef066
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 22:15:00 2015 +1100

    ctdb-ipalloc: Don't consider runstates in the IP takeover code
    
    Checking runstates is unnecessary now that nodes that are not RUNNING
    will return no available IP addresses.  I have no idea why I didn't do
    it this way originally.
    
    Tweak the test code to cope with this.
    
    Note that this is a backward-incompatible change.  If new and old
    versions of CTDB are running together in a cluster and a new node
    takes over as recovery master then old nodes will be able to host
    public IP addresses before they are in RUNNING runstate.  This is
    mitigated by the bias towards recovery master stability in elections.
    If it is important that nodes do not host IPs until they are RUNNING
    then do not restart nodes running the old version.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 9ea318847a3ec2a18ce897833b8a5fcf6d8c4205
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 22:11:33 2015 +1100

    ctdb-ipalloc: Check for available IPs, not runstate, in takeover run
    
    The available IPs list is now only non-empty for nodes that are in
    RUNNING runstate.  So, to avoid running the IP allocation algorithm
    when there are no available available IPs, explicitly check for
    available IPs rather than checking runstates.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 66574c9fda26ee72f04426b2f7c96fe617332620
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 20:41:08 2015 +1100

    ctdb-ipalloc: A VNN can only host IPs if node is in RUNNING runstate
    
    This will allow wonderful simplification (i.e. removal) of some of the
    runstate checking in the takeover run code.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

-----------------------------------------------------------------------

Summary of changes:
 WHATSNEW.txt                         |  11 ++
 ctdb/doc/ctdb-tunables.7.xml         |  28 ++--
 ctdb/doc/ctdb.1.xml                  |   2 +-
 ctdb/include/ctdb_private.h          |   1 -
 ctdb/protocol/protocol.h             |   2 +-
 ctdb/server/ctdb_monitor.c           |  13 +-
 ctdb/server/ctdb_recoverd.c          | 297 +++++++++++++++--------------------
 ctdb/server/ctdb_takeover.c          | 131 ++-------------
 ctdb/server/ctdb_tunables.c          |   2 +-
 ctdb/server/eventscript.c            |   5 +-
 ctdb/tests/src/ctdb_takeover_tests.c |  54 ++++---
 11 files changed, 212 insertions(+), 334 deletions(-)


Changeset truncated at 500 lines:

diff --git a/WHATSNEW.txt b/WHATSNEW.txt
index f1a1c07..53f7860 100644
--- a/WHATSNEW.txt
+++ b/WHATSNEW.txt
@@ -41,6 +41,17 @@ smb.conf changes
   --------------		-----------		-------
   aio max threads               New                     100
 
+CTDB changes
+------------
+
+* The CTDB tunable parameter EventScriptTimeoutCount has been renamed
+  to MonitorTimeoutCount
+
+  It has only ever been used to limit timed-out monitor events.
+
+  Configurations containing CTDB_SET_EventScriptTimeoutCount=<n> will
+  cause CTDB to fail at startup.  Useful messages will be logged.
+
 KNOWN ISSUES
 ============
 
diff --git a/ctdb/doc/ctdb-tunables.7.xml b/ctdb/doc/ctdb-tunables.7.xml
index b029fdb..6c164f3 100644
--- a/ctdb/doc/ctdb-tunables.7.xml
+++ b/ctdb/doc/ctdb-tunables.7.xml
@@ -166,29 +166,29 @@
 
     <refsect2>
       <title>EventScriptTimeout</title>
-      <para>Default: 20</para>
+      <para>Default: 30</para>
       <para>
-	How long should ctdb let an event script run before aborting it and
-	marking the node unhealthy.
+	Maximum time in seconds to allow an event to run before timing
+	out.  This is the total time for all enabled scripts that are
+	run for an event, not just a single event script.
       </para>
-    </refsect2>
 
-    <refsect2>
-      <title>EventScriptTimeoutCount</title>
-      <para>Default: 1</para>
       <para>
-	How many events in a row needs to timeout before we flag the node UNHEALTHY.
-	This setting is useful if your scripts can not be written so that they
-	do not hang for benign reasons.
+	Note that timeouts are ignored for some events ("takeip",
+	"releaseip", "startrecovery", "recovered") and converted to
+	success.  The logic here is that the callers of these events
+	implement their own additional timeout.
       </para>
     </refsect2>
 
     <refsect2>
-      <title>EventScriptUnhealthyOnTimeout</title>
-      <para>Default: 0</para>
+      <title>MonitorTimeoutCount</title>
+      <para>Default: 20</para>
       <para>
-	This setting can be be used to make ctdb never become UNHEALTHY if your
-	eventscripts keep hanging/timing out.
+	How many monitor events in a row need to timeout before a node
+	is flagged as UNHEALTHY.  This setting is useful if scripts
+	can not be written so that they do not hang for benign
+	reasons.
       </para>
     </refsect2>
 
diff --git a/ctdb/doc/ctdb.1.xml b/ctdb/doc/ctdb.1.xml
index 0ff848c..3658c89 100644
--- a/ctdb/doc/ctdb.1.xml
+++ b/ctdb/doc/ctdb.1.xml
@@ -650,7 +650,7 @@ TakeoverTimeout         = 9
 MonitorInterval         = 15
 TickleUpdateInterval    = 20
 EventScriptTimeout      = 30
-EventScriptTimeoutCount = 1
+MonitorTimeoutCount     = 1
 RecoveryGracePeriod     = 120
 RecoveryBanPeriod       = 300
 DatabaseHashSize        = 100001
diff --git a/ctdb/include/ctdb_private.h b/ctdb/include/ctdb_private.h
index 2542c4a..57a13d9 100644
--- a/ctdb/include/ctdb_private.h
+++ b/ctdb/include/ctdb_private.h
@@ -356,7 +356,6 @@ struct ctdb_context {
 	int start_as_disabled;
 	int start_as_stopped;
 	bool valgrinding;
-	uint32_t event_script_timeouts; /* counting how many consecutive times an eventscript has timedout */
 	uint32_t *recd_ping_count;
 	TALLOC_CTX *recd_ctx; /* a context used to track recoverd monitoring events */
 	TALLOC_CTX *release_ips_ctx; /* a context used to automatically drop all IPs if we fail to recover the node */
diff --git a/ctdb/protocol/protocol.h b/ctdb/protocol/protocol.h
index 27bc828..672b5ed 100644
--- a/ctdb/protocol/protocol.h
+++ b/ctdb/protocol/protocol.h
@@ -572,7 +572,7 @@ struct ctdb_tunable_list {
 	uint32_t monitor_interval;
 	uint32_t tickle_update_interval;
 	uint32_t script_timeout;
-	uint32_t script_timeout_count; /* allow dodgy scripts to hang this many times in a row before we mark the node unhealthy */
+	uint32_t monitor_timeout_count; /* allow dodgy scripts to hang this many times in a row before we mark the node unhealthy */
 	uint32_t script_unhealthy_on_timeout; /* obsolete */
 	uint32_t recovery_grace_period;
 	uint32_t recovery_ban_period;
diff --git a/ctdb/server/ctdb_monitor.c b/ctdb/server/ctdb_monitor.c
index 44cc547..4f998d3 100644
--- a/ctdb/server/ctdb_monitor.c
+++ b/ctdb/server/ctdb_monitor.c
@@ -39,6 +39,7 @@ struct ctdb_monitor_state {
 	uint32_t monitoring_mode;
 	TALLOC_CTX *monitor_context;
 	uint32_t next_interval;
+	uint32_t event_script_timeouts;
 };
 
 static void ctdb_check_health(struct tevent_context *ev,
@@ -144,14 +145,20 @@ static void ctdb_health_callback(struct ctdb_context *ctdb, int status, void *p)
 	}
 
 	if (status == -ETIME) {
-		ctdb->event_script_timeouts++;
+		ctdb->monitor->event_script_timeouts++;
 
-		if (ctdb->event_script_timeouts >= ctdb->tunable.script_timeout_count) {
-			DEBUG(DEBUG_ERR, ("Maximum timeout count %u reached for eventscript. Making node unhealthy\n", ctdb->tunable.script_timeout_count));
+		if (ctdb->monitor->event_script_timeouts >=
+		    ctdb->tunable.monitor_timeout_count) {
+			DEBUG(DEBUG_ERR,
+			      ("Maximum monitor timeout count %u reached."
+			       " Making node unhealthy\n",
+			       ctdb->tunable.monitor_timeout_count));
 		} else {
 			/* We pretend this is OK. */
 			goto after_change_status;
 		}
+	} else {
+		ctdb->monitor->event_script_timeouts = 0;
 	}
 
 	if (status != 0 && !(node->flags & NODE_FLAGS_UNHEALTHY)) {
diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index 13f833f..3f0cffc 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -418,7 +418,7 @@ static int run_startrecovery_eventscript(struct ctdb_recoverd *rec, struct ctdb_
 }
 
 /*
-  update the node capabilities for all connected nodes
+  Retrieve capabilities from all connected nodes
  */
 static int update_capabilities(struct ctdb_recoverd *rec,
 			       struct ctdb_node_map_old *nodemap)
@@ -528,36 +528,6 @@ static int set_recovery_mode(struct ctdb_context *ctdb,
 	return 0;
 }
 
-/*
-  change recovery master on all node
- */
-static int set_recovery_master(struct ctdb_context *ctdb, struct ctdb_node_map_old *nodemap, uint32_t pnn)
-{
-	TDB_DATA data;
-	TALLOC_CTX *tmp_ctx;
-	uint32_t *nodes;
-
-	tmp_ctx = talloc_new(ctdb);
-	CTDB_NO_MEMORY(ctdb, tmp_ctx);
-
-	data.dsize = sizeof(uint32_t);
-	data.dptr = (unsigned char *)&pnn;
-
-	nodes = list_of_active_nodes(ctdb, nodemap, tmp_ctx, true);
-	if (ctdb_client_async_control(ctdb, CTDB_CONTROL_SET_RECMASTER,
-					nodes, 0,
-					CONTROL_TIMEOUT(), false, data,
-					NULL, NULL,
-					NULL) != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to set recmaster. Recovery failed.\n"));
-		talloc_free(tmp_ctx);
-		return -1;
-	}
-
-	talloc_free(tmp_ctx);
-	return 0;
-}
-
 /* update all remote nodes to use the same db priority that we have
    this can fail if the remove node has not yet been upgraded to 
    support this function, so we always return success and never fail
@@ -2000,15 +1970,6 @@ static int db_recovery_serial(struct ctdb_recoverd *rec, TALLOC_CTX *mem_ctx,
 
 	DEBUG(DEBUG_NOTICE, (__location__ " Recovery - updated vnnmap\n"));
 
-	/* update recmaster to point to us for all nodes */
-	ret = set_recovery_master(ctdb, nodemap, pnn);
-	if (ret!=0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to set recovery master\n"));
-		return -1;
-	}
-
-	DEBUG(DEBUG_NOTICE, (__location__ " Recovery - updated recmaster\n"));
-
 	/* disable recovery mode */
 	ret = set_recovery_mode(ctdb, rec, nodemap, CTDB_RECOVERY_NORMAL, false);
 	if (ret != 0) {
@@ -2038,20 +1999,12 @@ static int do_recovery(struct ctdb_recoverd *rec,
 	DEBUG(DEBUG_NOTICE, (__location__ " Starting do_recovery\n"));
 
 	/* Check if the current node is still the recmaster.  It's possible that
-	 * re-election has changed the recmaster, but we have not yet updated
-	 * that information.
+	 * re-election has changed the recmaster.
 	 */
-	ret = ctdb_ctrl_getrecmaster(ctdb, mem_ctx, CONTROL_TIMEOUT(),
-				     pnn, &ctdb->recovery_master);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to get recmaster\n"));
-		return -1;
-	}
-
-	if (pnn != ctdb->recovery_master) {
+	if (pnn != rec->recmaster) {
 		DEBUG(DEBUG_NOTICE,
 		      ("Recovery master changed to %u, aborting recovery\n",
-		       ctdb->recovery_master));
+		       rec->recmaster));
 		return -1;
 	}
 
@@ -2146,7 +2099,7 @@ static int do_recovery(struct ctdb_recoverd *rec,
 	*/
 	sync_recovery_lock_file_across_cluster(rec);
 
-	/* update the capabilities for all nodes */
+	/* Retrieve capabilities from all connected nodes */
 	ret = update_capabilities(rec, nodemap);
 	if (ret!=0) {
 		DEBUG(DEBUG_ERR, (__location__ " Unable to update node capabilities.\n"));
@@ -2377,12 +2330,13 @@ static int send_election_request(struct ctdb_recoverd *rec, uint32_t pnn)
 	/* first we assume we will win the election and set 
 	   recoverymaster to be ourself on the current node
 	 */
-	ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(), pnn, pnn);
+	ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(),
+				     CTDB_CURRENT_NODE, pnn);
 	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " failed to send recmaster election request\n"));
+		DEBUG(DEBUG_ERR, (__location__ " failed to set recmaster\n"));
 		return -1;
 	}
-
+	rec->recmaster = pnn;
 
 	/* send an election message to all active nodes */
 	DEBUG(DEBUG_INFO,(__location__ " Send election request to all active nodes\n"));
@@ -2786,11 +2740,13 @@ static void election_handler(uint64_t srvid, TDB_DATA data, void *private_data)
 	clear_ip_assignment_tree(ctdb);
 
 	/* ok, let that guy become recmaster then */
-	ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(), ctdb_get_pnn(ctdb), em->pnn);
+	ret = ctdb_ctrl_setrecmaster(ctdb, CONTROL_TIMEOUT(),
+				     CTDB_CURRENT_NODE, em->pnn);
 	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " failed to send recmaster election request"));
+		DEBUG(DEBUG_ERR, (__location__ " failed to set recmaster"));
 		return;
 	}
+	rec->recmaster = em->pnn;
 
 	return;
 }
@@ -2883,16 +2839,11 @@ static void monitor_handler(uint64_t srvid, TDB_DATA data, void *private_data)
 
 	nodemap->nodes[i].flags = c->new_flags;
 
-	ret = ctdb_ctrl_getrecmaster(ctdb, tmp_ctx, CONTROL_TIMEOUT(), 
-				     CTDB_CURRENT_NODE, &ctdb->recovery_master);
+	ret = ctdb_ctrl_getrecmode(ctdb, tmp_ctx, CONTROL_TIMEOUT(),
+				   CTDB_CURRENT_NODE, &ctdb->recovery_mode);
 
-	if (ret == 0) {
-		ret = ctdb_ctrl_getrecmode(ctdb, tmp_ctx, CONTROL_TIMEOUT(), 
-					   CTDB_CURRENT_NODE, &ctdb->recovery_mode);
-	}
-	
 	if (ret == 0 &&
-	    ctdb->recovery_master == ctdb->pnn &&
+	    rec->recmaster == ctdb->pnn &&
 	    ctdb->recovery_mode == CTDB_RECOVERY_NORMAL) {
 		/* Only do the takeover run if the perm disabled or unhealthy
 		   flags changed since these will cause an ip failover but not
@@ -2922,19 +2873,11 @@ static void push_flags_handler(uint64_t srvid, TDB_DATA data,
 	struct ctdb_node_flag_change *c = (struct ctdb_node_flag_change *)data.dptr;
 	struct ctdb_node_map_old *nodemap=NULL;
 	TALLOC_CTX *tmp_ctx = talloc_new(ctdb);
-	uint32_t recmaster;
 	uint32_t *nodes;
 
-	/* find the recovery master */
-	ret = ctdb_ctrl_getrecmaster(ctdb, tmp_ctx, CONTROL_TIMEOUT(), CTDB_CURRENT_NODE, &recmaster);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to get recmaster from local node\n"));
-		talloc_free(tmp_ctx);
-		return;
-	}
-
 	/* read the node flags from the recmaster */
-	ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), recmaster, tmp_ctx, &nodemap);
+	ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), rec->recmaster,
+				   tmp_ctx, &nodemap);
 	if (ret != 0) {
 		DEBUG(DEBUG_ERR, (__location__ " Unable to get nodemap from node %u\n", c->pnn));
 		talloc_free(tmp_ctx);
@@ -3108,9 +3051,12 @@ static enum monitor_result verify_recmaster(struct ctdb_recoverd *rec, struct ct
 	rmdata->pnn    = pnn;
 	rmdata->status = MONITOR_OK;
 
-	/* loop over all active nodes and send an async getrecmaster call to 
+	/* loop over all active nodes and send an async getrecmaster call to
 	   them*/
 	for (j=0; j<nodemap->num; j++) {
+		if (nodemap->nodes[j].pnn == rec->recmaster) {
+			continue;
+		}
 		if (nodemap->nodes[j].flags & NODE_FLAGS_INACTIVE) {
 			continue;
 		}
@@ -3384,12 +3330,98 @@ static int update_recovery_lock_file(struct ctdb_context *ctdb)
 	return 0;
 }
 
+static enum monitor_result validate_recovery_master(struct ctdb_recoverd *rec,
+						    TALLOC_CTX *mem_ctx)
+{
+	struct ctdb_context *ctdb = rec->ctdb;
+	uint32_t pnn = ctdb_get_pnn(ctdb);
+	struct ctdb_node_map_old *nodemap = rec->nodemap;
+	struct ctdb_node_map_old *recmaster_nodemap = NULL;
+	int ret;
+
+	/* When recovery daemon is started, recmaster is set to
+	 * "unknown" so it knows to start an election.
+	 */
+	if (rec->recmaster == CTDB_UNKNOWN_PNN) {
+		DEBUG(DEBUG_NOTICE,
+		      ("Initial recovery master set - forcing election\n"));
+		return MONITOR_ELECTION_NEEDED;
+	}
+
+	/*
+	 * If the current recmaster does not have CTDB_CAP_RECMASTER,
+	 * but we have, then force an election and try to become the new
+	 * recmaster.
+	 */
+	if (!ctdb_node_has_capabilities(rec->caps,
+					rec->recmaster,
+					CTDB_CAP_RECMASTER) &&
+	    (rec->ctdb->capabilities & CTDB_CAP_RECMASTER) &&
+	    !(nodemap->nodes[pnn].flags & NODE_FLAGS_INACTIVE)) {
+		DEBUG(DEBUG_ERR,
+		      (" Current recmaster node %u does not have CAP_RECMASTER,"
+		       " but we (node %u) have - force an election\n",
+		       rec->recmaster, pnn));
+		return MONITOR_ELECTION_NEEDED;
+	}
+
+	/* Verify that the master node has not been deleted.  This
+	 * should not happen because a node should always be shutdown
+	 * before being deleted, causing a new master to be elected
+	 * before now.  However, if something strange has happened
+	 * then checking here will ensure we don't index beyond the
+	 * end of the nodemap array. */
+	if (rec->recmaster >= nodemap->num) {
+		DEBUG(DEBUG_ERR,
+		      ("Recmaster node %u has been deleted. Force election\n",
+		       rec->recmaster));
+		return MONITOR_ELECTION_NEEDED;
+	}
+
+	/* if recovery master is disconnected/deleted we must elect a new recmaster */
+	if (nodemap->nodes[rec->recmaster].flags &
+	    (NODE_FLAGS_DISCONNECTED|NODE_FLAGS_DELETED)) {
+		DEBUG(DEBUG_NOTICE,
+		      ("Recmaster node %u is disconnected/deleted. Force election\n",
+		       rec->recmaster));
+		return MONITOR_ELECTION_NEEDED;
+	}
+
+	/* get nodemap from the recovery master to check if it is inactive */
+	ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), rec->recmaster,
+				   mem_ctx, &recmaster_nodemap);
+	if (ret != 0) {
+		DEBUG(DEBUG_ERR,
+		      (__location__
+		       " Unable to get nodemap from recovery master %u\n",
+			  rec->recmaster));
+		return MONITOR_FAILED;
+	}
+
+
+	if ((recmaster_nodemap->nodes[rec->recmaster].flags & NODE_FLAGS_INACTIVE) &&
+	    (rec->node_flags & NODE_FLAGS_INACTIVE) == 0) {
+		DEBUG(DEBUG_NOTICE,
+		      ("Recmaster node %u is inactive. Force election\n",
+		       rec->recmaster));
+		/*
+		 * update our nodemap to carry the recmaster's notion of
+		 * its own flags, so that we don't keep freezing the
+		 * inactive recmaster node...
+		 */
+		nodemap->nodes[rec->recmaster].flags =
+			recmaster_nodemap->nodes[rec->recmaster].flags;
+		return MONITOR_ELECTION_NEEDED;
+	}
+
+	return MONITOR_OK;
+}
+
 static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 		      TALLOC_CTX *mem_ctx)
 {
 	uint32_t pnn;
 	struct ctdb_node_map_old *nodemap=NULL;
-	struct ctdb_node_map_old *recmaster_nodemap=NULL;
 	struct ctdb_node_map_old **remote_nodemaps=NULL;
 	struct ctdb_vnn_map *vnnmap=NULL;
 	struct ctdb_vnn_map *remote_vnnmap=NULL;
@@ -3442,29 +3474,10 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 		return;
 	}
 
-	/* Make sure that if recovery lock verification becomes disabled when
-	   we close the file
-	*/
-        if (ctdb->recovery_lock_file == NULL) {
-		ctdb_recovery_unlock(ctdb);
-	}
-
 	pnn = ctdb_get_pnn(ctdb);
 
-	/* get the vnnmap */
-	ret = ctdb_ctrl_getvnnmap(ctdb, CONTROL_TIMEOUT(), pnn, mem_ctx, &vnnmap);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to get vnnmap from node %u\n", pnn));
-		return;
-	}
-
-
-	/* get number of nodes */
-	if (rec->nodemap) {
-		talloc_free(rec->nodemap);
-		rec->nodemap = NULL;
-		nodemap=NULL;
-	}
+	/* get nodemap */
+	TALLOC_FREE(rec->nodemap);
 	ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), pnn, rec, &rec->nodemap);
 	if (ret != 0) {
 		DEBUG(DEBUG_ERR, (__location__ " Unable to get nodemap from node %u\n", pnn));
@@ -3520,13 +3533,6 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 		return;
 	}
 
-	/* check which node is the recovery master */
-	ret = ctdb_ctrl_getrecmaster(ctdb, mem_ctx, CONTROL_TIMEOUT(), pnn, &rec->recmaster);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, (__location__ " Unable to get recmaster from node %u\n", pnn));
-		return;
-	}
-
 	/* If we are not the recmaster then do some housekeeping */
 	if (rec->recmaster != pnn) {
 		/* Ignore any IP reallocate requests - only recmaster
@@ -3541,80 +3547,23 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 		TALLOC_FREE(rec->force_rebalance_nodes);
 	}
 
-	/* This is a special case.  When recovery daemon is started, recmaster
-	 * is set to -1.  If a node is not started in stopped state, then
-	 * start election to decide recovery master
-	 */
-	if (rec->recmaster == (uint32_t)-1) {
-		DEBUG(DEBUG_NOTICE,(__location__ " Initial recovery master set - forcing election\n"));
-		force_election(rec, pnn, nodemap);
-		return;
-	}
-
-	/* update the capabilities for all nodes */
+	/* Retrieve capabilities from all connected nodes */
 	ret = update_capabilities(rec, nodemap);
 	if (ret != 0) {
 		DEBUG(DEBUG_ERR, (__location__ " Unable to update node capabilities.\n"));
 		return;
 	}
 


-- 
Samba Shared Repository