[SCM] Samba Shared Repository - branch master updated

Thu Nov 12 08:29:03 UTC 2015

The branch, master has been updated
       via  0886637 ctdb-recoverd: Reload remote IPs as part of takeover run
       via  8cdae3a ctdb-recoverd: Move ctdb_reload_remote_public_ips() to ctdb_takeover.c
       via  c37e3c0 ctdb-recoverd: Remote IP validation can't cause a takeover run
       via  8b7b153 ctdb-recoverd: Drop culprit argument from ctdb_reload_remote_public_ips()
       via  1d7d5ab ctdb-recoverd: Trigger takeover run after rebalance timeout
       via  a7e8687 ctdb-recoverd: Remove unnecessary assignments of need_takeover_run
       via  d608862 ctdb-recoverd: Do not run recovery-related events around IP takeover
       via  bd0befa ctdb-recoverd: Drop some sanity checking in local IP verification
       via  ceb0988 ctdb-recoverd: Simplify using TALLOC_FREE()
      from  8936281 gss: samba member server returns incorrect error code with some versions of krb5

https://git.samba.org/?p=samba.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit 0886637a5eb4a5244631070abfa96e31e2610685
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 20:04:41 2015 +1100

    ctdb-recoverd: Reload remote IPs as part of takeover run
    
    This is currently done before each IP takeover run, so just factor it
    in.
    
    ctdb_reload_remote_public_ips() becomes static.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    
    Autobuild-User(master): Amitay Isaacs <amitay at samba.org>
    Autobuild-Date(master): Thu Nov 12 09:28:45 CET 2015 on sn-devel-104

commit 8cdae3ade65780ebdd975ff9660bddf3d475cb82
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Oct 29 11:50:24 2015 +1100

    ctdb-recoverd: Move ctdb_reload_remote_public_ips() to ctdb_takeover.c
    
    This will help to untangle known and available public IP lists from
    the CTDB context.
    
    verify_remote_ip_allocation() needs a forward declaration.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit c37e3c05b0e522da27c3e1d16533ebdd382d7ea3
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 20:33:29 2015 +1100

    ctdb-recoverd: Remote IP validation can't cause a takeover run
    
    Remote IP validation is only called when a takeover run is about to
    happen anyway, so don't bother flagging one.  Given that a takeover
    run isn't being triggered, also drop the test that checks if takeover
    runs are disabled.  These are the only uses of the rec argument, so
    drop it.
    
    One possible further simplification would be to remove this function
    because it doesn't accomplish anything.  However, it is worth leaving
    it as a reminder that remote IP validation should be done properly at
    some time in the future.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 8b7b153cf640c311d46f5e0c3297aaf8fa4149d6
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Oct 29 10:42:29 2015 +1100

    ctdb-recoverd: Drop culprit argument from ctdb_reload_remote_public_ips()
    
    It is only used by the caller to print a message that includes the
    culprit.  However, ctdb_reload_remote_public_ips() already prints
    perfectly good messages and they include the culprit.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit 1d7d5abd31fbe73cea048de4b673a12cdddd108c
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 19:56:02 2015 +1100

    ctdb-recoverd: Trigger takeover run after rebalance timeout
    
    No need to do it immediately.  It will happen in less than a second.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit a7e8687b6d6035faffff5d26dc737589ccdd7dd9
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 19:52:37 2015 +1100

    ctdb-recoverd: Remove unnecessary assignments of need_takeover_run
    
    do_takeover_run() unsets this if it succeeds.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit d608862c5a72fb491330623ae75b011fe85610fc
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Oct 28 19:47:03 2015 +1100

    ctdb-recoverd: Do not run recovery-related events around IP takeover
    
    This is not a recovery, so do not run "startrecovery and "recovered"
    events.  There are other IP takeover runs where these are not run.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit bd0befa529c88173926bb8739d19ce9639d32bdf
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Nov 9 15:32:58 2015 +1100

    ctdb-recoverd: Drop some sanity checking in local IP verification
    
    The recovery start/end times used in the checks at the top of
    verify_local_ip_allocation() are set by the START_RECOVERY and
    END_RECOVERY controls.  A couple of takeover runs escape the checks
    because they were added later and are not surrounded by these
    controls.
    
    Recovery and IP allocation need to be untangled from each other, so
    recovery-related events should not be relied on for IP allocation.
    This means the solution is not to add these where they are "missing".
    
    The concern that the checks are addressing is to avoid local IP
    verification when IP addresses are in a state of flux.  Takeover runs
    on non-master nodes are already disabled while a takeover run is in
    progress, so local IP verification is already skipped in that case.
    The other case is the master node, which will be busy with the
    takeover run, rather than running main_loop().
    
    The other issue is races.  verify_local_ip_allocation() takes a
    non-zero amount of time to fetch IP addresses from the local CTDB
    daemon and during this time a recovery or takeover run can start, but
    a takeover run can still be triggered.  The current tests do not stop
    this.
    
    Apart from all of this, with most reasonable public IP address
    configurations, an extra takeover run will be a no-op so is not a
    cause for concern.
    
    It is safe to drop these checks.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

commit ceb0988e14c5b3f0050ba94d9f97fcbd4fcfae8b
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Oct 23 16:03:38 2015 +1100

    ctdb-recoverd: Simplify using TALLOC_FREE()
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>

-----------------------------------------------------------------------

Summary of changes:
 ctdb/include/ctdb_private.h |   3 -
 ctdb/server/ctdb_recoverd.c | 208 ++------------------------------------------
 ctdb/server/ctdb_takeover.c |  79 ++++++++++++++++-
 3 files changed, 84 insertions(+), 206 deletions(-)


Changeset truncated at 500 lines:

diff --git a/ctdb/include/ctdb_private.h b/ctdb/include/ctdb_private.h
index e6f24e5..100b036 100644
--- a/ctdb/include/ctdb_private.h
+++ b/ctdb/include/ctdb_private.h
@@ -1007,9 +1007,6 @@ int32_t ctdb_control_del_public_address(struct ctdb_context *ctdb,
 					struct ctdb_req_control_old *c,
 					TDB_DATA recdata, bool *async_reply);
 
-int verify_remote_ip_allocation(struct ctdb_context *ctdb,
-				struct ctdb_public_ip_list_old *ips,
-				uint32_t pnn);
 int update_ip_assignment_tree(struct ctdb_context *ctdb,
 				struct ctdb_public_ip *ip);
 void clear_ip_assignment_tree(struct ctdb_context *ctdb);
diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index 66786f3..be419fb 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -1564,88 +1564,6 @@ static int recover_database(struct ctdb_recoverd *rec,
 	return 0;
 }
 
-static int ctdb_reload_remote_public_ips(struct ctdb_context *ctdb,
-					 struct ctdb_recoverd *rec,
-					 struct ctdb_node_map_old *nodemap,
-					 uint32_t *culprit)
-{
-	int j;
-	int ret;
-
-	if (ctdb->num_nodes != nodemap->num) {
-		DEBUG(DEBUG_ERR, (__location__ " ctdb->num_nodes (%d) != nodemap->num (%d) invalid param\n",
-				  ctdb->num_nodes, nodemap->num));
-		if (culprit) {
-			*culprit = ctdb->pnn;
-		}
-		return -1;
-	}
-
-	for (j=0; j<nodemap->num; j++) {
-		/* For readability */
-		struct ctdb_node *node = ctdb->nodes[j];
-
-		/* release any existing data */
-		if (node->known_public_ips) {
-			talloc_free(node->known_public_ips);
-			node->known_public_ips = NULL;
-		}
-		if (node->available_public_ips) {
-			talloc_free(node->available_public_ips);
-			node->available_public_ips = NULL;
-		}
-
-		if (nodemap->nodes[j].flags & NODE_FLAGS_INACTIVE) {
-			continue;
-		}
-
-		/* Retrieve the list of known public IPs from the node */
-		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
-					CONTROL_TIMEOUT(),
-					node->pnn,
-					ctdb->nodes,
-					0,
-					&node->known_public_ips);
-		if (ret != 0) {
-			DEBUG(DEBUG_ERR,
-			      ("Failed to read known public IPs from node: %u\n",
-			       node->pnn));
-			if (culprit) {
-				*culprit = node->pnn;
-			}
-			return -1;
-		}
-
-		if (ctdb->do_checkpublicip &&
-		    !ctdb_op_is_disabled(rec->takeover_run) &&
-		    verify_remote_ip_allocation(ctdb,
-						 node->known_public_ips,
-						 node->pnn)) {
-			DEBUG(DEBUG_ERR,("Trigger IP reallocation\n"));
-			rec->need_takeover_run = true;
-		}
-
-		/* Retrieve the list of available public IPs from the node */
-		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
-					CONTROL_TIMEOUT(),
-					node->pnn,
-					ctdb->nodes,
-					CTDB_PUBLIC_IP_FLAGS_ONLY_AVAILABLE,
-					&node->available_public_ips);
-		if (ret != 0) {
-			DEBUG(DEBUG_ERR,
-			      ("Failed to read available public IPs from node: %u\n",
-			       node->pnn));
-			if (culprit) {
-				*culprit = node->pnn;
-			}
-			return -1;
-		}
-	}
-
-	return 0;
-}
-
 /* when we start a recovery, make sure all nodes use the same reclock file
    setting
 */
@@ -2114,7 +2032,6 @@ static int do_recovery(struct ctdb_recoverd *rec,
 	int i, ret;
 	struct ctdb_dbid_map_old *dbmap;
 	struct timeval start_time;
-	uint32_t culprit = (uint32_t)-1;
 	bool self_ban;
 	bool par_recovery;
 
@@ -2282,15 +2199,6 @@ static int do_recovery(struct ctdb_recoverd *rec,
 		goto fail;
 	}
 
-	/* Fetch known/available public IPs from each active node */
-	ret = ctdb_reload_remote_public_ips(ctdb, rec, nodemap, &culprit);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR,("Failed to read public ips from remote node %d\n",
-				 culprit));
-		rec->need_takeover_run = true;
-		goto fail;
-	}
-
 	do_takeover_run(rec, nodemap, false);
 
 	/* execute the "recovered" event script on all nodes */
@@ -2527,8 +2435,7 @@ static void election_send_request(struct tevent_context *ev,
 		DEBUG(DEBUG_ERR,("Failed to send election request!\n"));
 	}
 
-	talloc_free(rec->send_election_te);
-	rec->send_election_te = NULL;
+	TALLOC_FREE(rec->send_election_te);
 }
 
 /*
@@ -2604,8 +2511,8 @@ static void ctdb_rebalance_timeout(struct tevent_context *ev,
 	}
 
 	DEBUG(DEBUG_NOTICE,
-	      ("Rebalance timeout occurred - do takeover run\n"));
-	do_takeover_run(rec, rec->nodemap, false);
+	      ("Rebalance timeout occurred - trigger takeover run\n"));
+	rec->need_takeover_run = true;
 }
 
 
@@ -2802,7 +2709,6 @@ static void process_ipreallocate_requests(struct ctdb_context *ctdb,
 {
 	TDB_DATA result;
 	int32_t ret;
-	uint32_t culprit;
 	struct srvid_requests *current;
 
 	DEBUG(DEBUG_INFO, ("recovery master forced ip reallocation\n"));
@@ -2815,21 +2721,10 @@ static void process_ipreallocate_requests(struct ctdb_context *ctdb,
 	current = rec->reallocate_requests;
 	rec->reallocate_requests = NULL;
 
-	/* update the list of public ips that a node can handle for
-	   all connected nodes
-	*/
-	ret = ctdb_reload_remote_public_ips(ctdb, rec, rec->nodemap, &culprit);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR,("Failed to read public ips from remote node %d\n",
-				 culprit));
-		rec->need_takeover_run = true;
-	}
-	if (ret == 0) {
-		if (do_takeover_run(rec, rec->nodemap, false)) {
-			ret = ctdb_get_pnn(ctdb);
-		} else {
-			ret = -1;
-		}
+	if (do_takeover_run(rec, rec->nodemap, false)) {
+		ret = ctdb_get_pnn(ctdb);
+	} else {
+		ret = -1;
 	}
 
 	result.dsize = sizeof(int32_t);
@@ -3317,19 +3212,9 @@ static bool interfaces_have_changed(struct ctdb_context *ctdb,
 static int verify_local_ip_allocation(struct ctdb_context *ctdb, struct ctdb_recoverd *rec, uint32_t pnn, struct ctdb_node_map_old *nodemap)
 {
 	TALLOC_CTX *mem_ctx = talloc_new(NULL);
-	struct ctdb_uptime *uptime1 = NULL;
-	struct ctdb_uptime *uptime2 = NULL;
 	int ret, j;
 	bool need_takeover_run = false;
 
-	ret = ctdb_ctrl_uptime(ctdb, mem_ctx, CONTROL_TIMEOUT(),
-				CTDB_CURRENT_NODE, &uptime1);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, ("Unable to get uptime from local node %u\n", pnn));
-		talloc_free(mem_ctx);
-		return -1;
-	}
-
 	if (interfaces_have_changed(ctdb, rec)) {
 		DEBUG(DEBUG_NOTICE, ("The interfaces status has changed on "
 				     "local node %u - force takeover run\n",
@@ -3337,39 +3222,6 @@ static int verify_local_ip_allocation(struct ctdb_context *ctdb, struct ctdb_rec
 		need_takeover_run = true;
 	}
 
-	ret = ctdb_ctrl_uptime(ctdb, mem_ctx, CONTROL_TIMEOUT(),
-				CTDB_CURRENT_NODE, &uptime2);
-	if (ret != 0) {
-		DEBUG(DEBUG_ERR, ("Unable to get uptime from local node %u\n", pnn));
-		talloc_free(mem_ctx);
-		return -1;
-	}
-
-	/* skip the check if the startrecovery time has changed */
-	if (timeval_compare(&uptime1->last_recovery_started,
-			    &uptime2->last_recovery_started) != 0) {
-		DEBUG(DEBUG_NOTICE, (__location__ " last recovery time changed while we read the public ip list. skipping public ip address check\n"));
-		talloc_free(mem_ctx);
-		return 0;
-	}
-
-	/* skip the check if the endrecovery time has changed */
-	if (timeval_compare(&uptime1->last_recovery_finished,
-			    &uptime2->last_recovery_finished) != 0) {
-		DEBUG(DEBUG_NOTICE, (__location__ " last recovery time changed while we read the public ip list. skipping public ip address check\n"));
-		talloc_free(mem_ctx);
-		return 0;
-	}
-
-	/* skip the check if we have started but not finished recovery */
-	if (timeval_compare(&uptime1->last_recovery_finished,
-			    &uptime1->last_recovery_started) != 1) {
-		DEBUG(DEBUG_INFO, (__location__ " in the middle of recovery or ip reallocation. skipping public ip address check\n"));
-		talloc_free(mem_ctx);
-
-		return 0;
-	}
-
 	/* verify that we have the ip addresses we should have
 	   and we don't have ones we shouldnt have.
 	   if we find an inconsistency we set recmode to
@@ -4073,56 +3925,12 @@ static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec,
 
 	/* we might need to change who has what IP assigned */
 	if (rec->need_takeover_run) {
-		uint32_t culprit = (uint32_t)-1;
-
-		rec->need_takeover_run = false;
-
-		/* update the list of public ips that a node can handle for
-		   all connected nodes
-		*/
-		ret = ctdb_reload_remote_public_ips(ctdb, rec, nodemap, &culprit);
-		if (ret != 0) {
-			DEBUG(DEBUG_ERR,("Failed to read public ips from remote node %d\n",
-					 culprit));
-			rec->need_takeover_run = true;
-			return;
-		}
-
-		/* execute the "startrecovery" event script on all nodes */
-		ret = run_startrecovery_eventscript(rec, nodemap);
-		if (ret!=0) {
-			DEBUG(DEBUG_ERR, (__location__ " Unable to run the 'startrecovery' event on cluster\n"));
-			ctdb_set_culprit(rec, ctdb->pnn);
-			do_recovery(rec, mem_ctx, pnn, nodemap, vnnmap);
-			return;
-		}
-
 		/* If takeover run fails, then the offending nodes are
 		 * assigned ban culprit counts. And we re-try takeover.
 		 * If takeover run fails repeatedly, the node would get
 		 * banned.
-		 *
-		 * If rec->need_takeover_run is not set to true at this
-		 * failure, monitoring is disabled cluster-wide (via
-		 * startrecovery eventscript) and will not get enabled.
 		 */
-		if (!do_takeover_run(rec, nodemap, true)) {
-			return;
-		}
-
-		/* execute the "recovered" event script on all nodes */
-		ret = run_recovered_eventscript(rec, nodemap, "monitor_cluster");
-#if 0
-// we cant check whether the event completed successfully
-// since this script WILL fail if the node is in recovery mode
-// and if that race happens, the code here would just cause a second
-// cascading recovery.
-		if (ret!=0) {
-			DEBUG(DEBUG_ERR, (__location__ " Unable to run the 'recovered' event on cluster. Update of public ips failed.\n"));
-			ctdb_set_culprit(rec, ctdb->pnn);
-			do_recovery(rec, mem_ctx, pnn, nodemap, vnnmap);
-		}
-#endif
+		do_takeover_run(rec, nodemap, true);
 	}
 }
 
diff --git a/ctdb/server/ctdb_takeover.c b/ctdb/server/ctdb_takeover.c
index 6462de8..d21125a 100644
--- a/ctdb/server/ctdb_takeover.c
+++ b/ctdb/server/ctdb_takeover.c
@@ -1388,6 +1388,72 @@ static int getips_count_callback(void *param, void *data)
 	return 0;
 }
 
+static int verify_remote_ip_allocation(struct ctdb_context *ctdb,
+				       struct ctdb_public_ip_list_old *ips,
+				       uint32_t pnn);
+
+static int ctdb_reload_remote_public_ips(struct ctdb_context *ctdb,
+					 struct ctdb_node_map_old *nodemap)
+{
+	int j;
+	int ret;
+
+	if (ctdb->num_nodes != nodemap->num) {
+		DEBUG(DEBUG_ERR, (__location__ " ctdb->num_nodes (%d) != nodemap->num (%d) invalid param\n",
+				  ctdb->num_nodes, nodemap->num));
+		return -1;
+	}
+
+	for (j=0; j<nodemap->num; j++) {
+		/* For readability */
+		struct ctdb_node *node = ctdb->nodes[j];
+
+		/* release any existing data */
+		TALLOC_FREE(node->known_public_ips);
+		TALLOC_FREE(node->available_public_ips);
+
+		if (nodemap->nodes[j].flags & NODE_FLAGS_INACTIVE) {
+			continue;
+		}
+
+		/* Retrieve the list of known public IPs from the node */
+		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
+					TAKEOVER_TIMEOUT(),
+					node->pnn,
+					ctdb->nodes,
+					0,
+					&node->known_public_ips);
+		if (ret != 0) {
+			DEBUG(DEBUG_ERR,
+			      ("Failed to read known public IPs from node: %u\n",
+			       node->pnn));
+			return -1;
+		}
+
+		if (ctdb->do_checkpublicip) {
+			verify_remote_ip_allocation(ctdb,
+						    node->known_public_ips,
+						    node->pnn);
+		}
+
+		/* Retrieve the list of available public IPs from the node */
+		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
+					TAKEOVER_TIMEOUT(),
+					node->pnn,
+					ctdb->nodes,
+					CTDB_PUBLIC_IP_FLAGS_ONLY_AVAILABLE,
+					&node->available_public_ips);
+		if (ret != 0) {
+			DEBUG(DEBUG_ERR,
+			      ("Failed to read available public IPs from node: %u\n",
+			       node->pnn));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
 static struct public_ip_list *
 create_merged_ip_list(struct ctdb_context *ctdb)
 {
@@ -2605,6 +2671,13 @@ int ctdb_takeover_run(struct ctdb_context *ctdb, struct ctdb_node_map_old *nodem
 		return -1;
 	}
 
+	/* Fetch known/available public IPs from each active node */
+	ret = ctdb_reload_remote_public_ips(ctdb, nodemap);
+	if (ret != 0) {
+		talloc_free(tmp_ctx);
+		return -1;
+	}
+
 	/* Short-circuit IP allocation if no nodes are in the RUNNING
 	 * runstate yet, since no nodes will be able to host IPs */
 	can_host_ips = false;
@@ -4219,9 +4292,9 @@ int32_t ctdb_control_ipreallocated(struct ctdb_context *ctdb,
    node has the expected ip allocation.
    This is verified against ctdb->ip_tree
 */
-int verify_remote_ip_allocation(struct ctdb_context *ctdb,
-				struct ctdb_public_ip_list_old *ips,
-				uint32_t pnn)
+static int verify_remote_ip_allocation(struct ctdb_context *ctdb,
+				       struct ctdb_public_ip_list_old *ips,
+				       uint32_t pnn)
 {
 	struct public_ip_list *tmp_ip;
 	int i;


-- 
Samba Shared Repository