[SCM] CTDB repository - branch master updated - ctdb-2.2-87-g4118262

Fri Jul 5 00:03:25 MDT 2013

The branch, master has been updated
       via  41182623891d74a7e9e9c453183411a161201e67 (commit)
       via  e1cf1f728236d808bb41265e74bc65f54bf1c133 (commit)
       via  f606df4f2db754592e6d1a16c26e155cacb2beef (commit)
       via  ceb5b2d37f7ab4894908ec26f3812b3bed991525 (commit)
       via  520914e7ee1b879c1080e5857fda18ed5b973fd6 (commit)
       via  4d0f26b306fc465d551d340b0e7dce4412eae3fd (commit)
       via  0a292fa8939a1343e44cadaa8ed9f3c0f18ca82f (commit)
       via  f0942fa01cd422133fc9398f56b4855397d7bc86 (commit)
       via  298c4d2c3b4ea3d900c91f5a0a5aca2952a13d61 (commit)
       via  9f6cd8b0bea619991c9f3bf35188c5950dabf8f4 (commit)
       via  035bf3eecf99337c84d4ad16cdbf297b1fa037db (commit)
       via  3af2d833b63af9931792106db71797f3692669a8 (commit)
       via  c0a9456692c88a7a5542cd893d8f326524d3f94e (commit)
       via  ce04f1c107b4392ca955d9f29b93aaaae62439ce (commit)
       via  c5797f2942e83da24df548ea07196fbbac0eab20 (commit)
       via  f1f1b0c24b9b6cd24b83a4e4da16e179287ec6ac (commit)
      from  16afe36de52561a62372c14b567683dc898369d5 (commit)

http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit 41182623891d74a7e9e9c453183411a161201e67
Author: Amitay Isaacs <amitay at gmail.com>
Date:   Fri Jul 5 14:04:20 2013 +1000

    recoverd: Fix buffer overflow error in reloadips
    
    Signed-off-by: Amitay Isaacs <amitay at gmail.com>
    Pair-Programmed-With: Martin Schwenke <martin at meltin.net>

commit e1cf1f728236d808bb41265e74bc65f54bf1c133
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Jul 4 20:02:29 2013 +1000

    tests/eventscripts: Add some rudimentary tests for 60.ganesha
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit f606df4f2db754592e6d1a16c26e155cacb2beef
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Jul 4 16:05:01 2013 +1000

    eventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK
    
    This allows 60.ganesha to be unit tested, except for the core Ganesha
    monitoring code.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit ceb5b2d37f7ab4894908ec26f3812b3bed991525
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Jul 4 16:00:33 2013 +1000

    eventscript: Move Ganesha nfsd monitoring to a function
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 520914e7ee1b879c1080e5857fda18ed5b973fd6
Author: Martin Schwenke <martin at meltin.net>
Date:   Thu Jul 4 15:11:54 2013 +1000

    eventscripts: Drop RPC service version from nfs_check_rpc_service() calls
    
    Support for this was removed in commit
    77302dbfd85754e02559eccb2dd6c090db0b6b9f and I overlooked its use in
    60.ganesha.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Pair-programmed-with: Amitay Isaacs <amitay at gmail.com>

commit 4d0f26b306fc465d551d340b0e7dce4412eae3fd
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Jul 2 14:43:17 2013 +1000

    ctdbd: Log something when releasing all IPs
    
    At the moment this is silent and it can be confusing to see IPs just
    disappear.
    
    Also, this message:
    
      Been in recovery mode for too long. Dropping all IPS
    
    can cause anxiety when all IPs should already have been dropped.
    Adding a comforting message saying that 0 IPs were dropped relieves
    such anxiety.  :-)
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 0a292fa8939a1343e44cadaa8ed9f3c0f18ca82f
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 19:00:36 2013 +1000

    recoverd: Minor style improvements for ctdb_reload_remote_public_ips()
    
    * Add a variable to the loop to make the code more readable and have
      it generally fit into 80 columns.
    
    * Improve comments.
    
    * Improve log messages.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit f0942fa01cd422133fc9398f56b4855397d7bc86
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 18:45:46 2013 +1000

    recoverd: Clean up log messages in remote IP verification
    
    The log messages in verify_remote_ip_allocation() are confusing
    because they don't include the PNN of the problem node, because it is
    not known in this function.
    
    Add the PNN of the node being verified as a function argument and then
    shuffle the log messages around to make them clearer.
    
    Also fold 3 nested if statements into just one.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 298c4d2c3b4ea3d900c91f5a0a5aca2952a13d61
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 17:57:33 2013 +1000

    recoverd: Fix an unclear log message - "Restart recovery process"
    
    When the recovery master notices a node in recovery mode it starts the
    recovery process, it doesn't restart it.
    
    Update documentation to match.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 9f6cd8b0bea619991c9f3bf35188c5950dabf8f4
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 17:53:37 2013 +1000

    recoverd: Fix an incorrect comment
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 035bf3eecf99337c84d4ad16cdbf297b1fa037db
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 17:48:01 2013 +1000

    ctdbd: Use ctdb_die() on "setup" event failure
    
    This is slightly easier to read because it all fits on 1 line.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit 3af2d833b63af9931792106db71797f3692669a8
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 17:43:52 2013 +1000

    ctdbd: Avoid a core dump when "init" event fails
    
    The "init" event only really fails in the scripts, which should log
    something useful on failure.  Therefore, a core dump isn't terribly
    useful and sometimes attracts unwanted attention.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit c0a9456692c88a7a5542cd893d8f326524d3f94e
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jun 30 17:42:11 2013 +1000

    util: New function ctdb_die()
    
    This is like ctdb_fatal() but exits cleanly without dumping core or
    generating a backtrace.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit ce04f1c107b4392ca955d9f29b93aaaae62439ce
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Jun 24 19:03:26 2013 +1000

    eventscripts: When replaying monitor status, don't log empty output
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>

commit c5797f2942e83da24df548ea07196fbbac0eab20
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Jun 24 16:05:03 2013 +1000

    ctdbd: Release IP callback should fail if the IP is still hosted
    
    At the moment there (at least) are 2 bugs that cause rogue IPs:
    
    * A race where release_ip_callback() runs after a "subsequent" take IP
      has completed.  The IP is back on an interface but we unset
      vnn->iface in the callback.
    
    * A "releaseip" eventscript times out.  We ignore the timeout and call
      it success, deleting the VNN even if the IP is still hosted.
    
      We could decide not to ignore the timeout and ban the node, but
      killing TCP connections can take a long time and that might result
      in a lot of manning.  We probably won't reinstate banning on
      "releaseip" until killing TCP connections has been optimised.
    
    In both cases, a rogue IP can be avoided by leaving vnn->iface set and
    simply failing the control.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Pair-programmed-with: Amitay Isaacs <amitay at gmail.com>

commit f1f1b0c24b9b6cd24b83a4e4da16e179287ec6ac
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Jun 24 15:49:48 2013 +1000

    ctdbd: Log warnings in release IP when unexpected interface is encountered
    
    Previous code changes work around a potential problems but do not
    provide useful information when the a problem occurs.
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Pair-programmed-with: Amitay Isaacs <amitay at gmail.com>

-----------------------------------------------------------------------

Summary of changes:
 common/ctdb_util.c                                 |    9 ++
 config/events.d/60.ganesha                         |  114 +++++++++++---------
 config/functions                                   |    4 +-
 doc/recovery-process.txt                           |    2 +-
 include/ctdb_private.h                             |    4 +-
 server/ctdb_daemon.c                               |    5 +-
 server/ctdb_recoverd.c                             |   61 ++++++-----
 server/ctdb_takeover.c                             |   74 +++++++++++--
 ...fs.monitor.101.sh => 60.ganesha.monitor.101.sh} |    2 +-
 tests/eventscripts/60.ganesha.monitor.131.sh       |   17 +++
 tests/eventscripts/60.ganesha.monitor.141.sh       |   39 +++++++
 tests/eventscripts/scripts/local.sh                |   14 +++
 12 files changed, 246 insertions(+), 99 deletions(-)
 copy tests/eventscripts/{60.nfs.monitor.101.sh => 60.ganesha.monitor.101.sh} (85%)
 create mode 100755 tests/eventscripts/60.ganesha.monitor.131.sh
 create mode 100755 tests/eventscripts/60.ganesha.monitor.141.sh


Changeset truncated at 500 lines:

diff --git a/common/ctdb_util.c b/common/ctdb_util.c
index d2bce36..a2da3bc 100644
--- a/common/ctdb_util.c
+++ b/common/ctdb_util.c
@@ -59,6 +59,15 @@ void ctdb_fatal(struct ctdb_context *ctdb, const char *msg)
 	abort();
 }
 
+/*
+  like ctdb_fatal() but a core/backtrace would not be useful
+*/
+void ctdb_die(struct ctdb_context *ctdb, const char *msg)
+{
+	DEBUG(DEBUG_ALERT,("ctdb exiting with error: %s\n", msg));
+	exit(1);
+}
+
 /* Invoke an external program to do some sort of tracing on the CTDB
  * process.  This might block for a little while.  The external
  * program is specified by the environment variable
diff --git a/config/events.d/60.ganesha b/config/events.d/60.ganesha
index 0066c54..09860d0 100755
--- a/config/events.d/60.ganesha
+++ b/config/events.d/60.ganesha
@@ -88,6 +88,63 @@ create_ganesha_recdirs ()
     mkdir -p $GANRECDIR3
 }
 
+monitor_ganesha_nfsd ()
+{
+	create_ganesha_recdirs
+	service_name=${service_name}_process
+
+	PIDFILE="/var/run/ganesha.pid"
+	CUR_STATE=`get_cluster_fs_state`
+	GANESHA="/usr/bin/$CTDB_CLUSTER_FILESYSTEM_TYPE.ganesha.nfsd"
+	if { read PID < $PIDFILE && \
+	    grep "$GANESHA" "/proc/$PID/cmdline" ; } >/dev/null 2>&1 ; then
+		ctdb_counter_init "$service_name"
+	else
+	    if [ $CUR_STATE = "active" ]; then
+		echo "Trying fast restart of NFS service"
+		startstop_ganesha restart
+		ctdb_counter_incr "$service_name"
+		ctdb_check_counter "error" "-ge" "6" "$service_name"
+	    fi
+	fi
+
+	service_name="nfs-ganesha-$CTDB_CLUSTER_FILESYSTEM_TYPE"_service
+	# check that NFS is posting forward progress
+	if [ $CUR_STATE = "active" -a "$CTDB_NFS_SKIP_KNFSD_ALIVE_CHECK" != "yes" ] ; then
+	    MAXREDS=2
+	    MAXSTALL=120
+	    RESTART=0
+
+	    NUMREDS=`ls $GANRECDIR3 | grep "red" | wc -l`
+	    LASTONE=`ls -t $GANRECDIR3 | sed 's/_/ /' | awk 'NR > 1 {next} {printf $1} '`
+	    # Beware of startup
+	    if [ -z $LASTONE ] ; then
+		LASTONE=`date +"%s"`
+	    fi
+	    TNOW=$(date +"%s")
+	    TSTALL=$(($TNOW - $LASTONE))
+	    if [ $NUMREDS -ge $MAXREDS ] ; then
+		echo restarting because of $NUMREDS red conditions
+		RESTART=1
+		ctdb_counter_incr "$service_name"
+		ctdb_check_counter "error" "-ge" "6" "$service_name"
+	    fi
+	    if [ $TSTALL -ge $MAXSTALL ] ; then
+		echo restarting because of $TSTALL second stall
+		RESTART=1
+		ctdb_counter_incr "$service_name"
+		ctdb_check_counter "error" "-ge" "6" "$service_name"
+	    fi
+	    if [ $RESTART -gt 0 ] ; then
+		startstop_ganesha restart
+	    else
+		ctdb_counter_init "$service_name"
+	    fi
+	fi
+}
+
+############################################################
+
 case "$1" in
      init)
 	# read statd from persistent database
@@ -131,8 +188,7 @@ case "$1" in
 
      monitor)
 	update_tickles 2049
-	create_ganesha_recdirs
-	service_name=${service_name}_process
+
 	# check that statd responds to rpc requests
 	# if statd is not running we try to restart it
 	# we only do this IF we have a rpc.statd command.
@@ -140,64 +196,18 @@ case "$1" in
         # the check completely
 	p="rpc.statd"
 	which $p >/dev/null 2>/dev/null && \
-	    nfs_check_rpc_service "statd" 1 \
+	    nfs_check_rpc_service "statd" \
 		-ge 6 "verbose unhealthy" \
 		-eq 4 "verbose restart" \
 		-eq 2 "restart:bs"
 
-	PIDFILE="/var/run/ganesha.pid"
-	CUR_STATE=`get_cluster_fs_state`
-	GANESHA="/usr/bin/$CTDB_CLUSTER_FILESYSTEM_TYPE.ganesha.nfsd"
-	if { read PID < $PIDFILE && \
-	    grep "$GANESHA" "/proc/$PID/cmdline" ; } >/dev/null 2>&1 ; then
-		ctdb_counter_init "$service_name"
-	else
-	    if [ $CUR_STATE = "active" ]; then
-		echo "Trying fast restart of NFS service"
-		startstop_ganesha restart
-		ctdb_counter_incr "$service_name"
-		ctdb_check_counter "error" "-ge" "6" "$service_name"
-	    fi
+	if [ "$CTDB_SKIP_GANESHA_NFSD_CHECK" != "yes" ] ; then
+	    monitor_ganesha_nfsd
 	fi
 
-	service_name="nfs-ganesha-$CTDB_CLUSTER_FILESYSTEM_TYPE"_service
-	# check that NFS is posting forward progress
-	if [ $CUR_STATE = "active" -a "$CTDB_NFS_SKIP_KNFSD_ALIVE_CHECK" != "yes" ] ; then
-	    MAXREDS=2
-	    MAXSTALL=120
-	    RESTART=0
-
-	    NUMREDS=`ls $GANRECDIR3 | grep "red" | wc -l`
-	    LASTONE=`ls -t $GANRECDIR3 | sed 's/_/ /' | awk 'NR > 1 {next} {printf $1} '`
-	    # Beware of startup
-	    if [ -z $LASTONE ] ; then
-		LASTONE=`date +"%s"`
-	    fi
-	    TNOW=$(date +"%s")
-	    TSTALL=$(($TNOW - $LASTONE))
-	    if [ $NUMREDS -ge $MAXREDS ] ; then
-		echo restarting because of $NUMREDS red conditions
-		RESTART=1
-		ctdb_counter_incr "$service_name"
-		ctdb_check_counter "error" "-ge" "6" "$service_name"
-	    fi
-	    if [ $TSTALL -ge $MAXSTALL ] ; then
-		echo restarting because of $TSTALL second stall
-		RESTART=1
-		ctdb_counter_incr "$service_name"
-		ctdb_check_counter "error" "-ge" "6" "$service_name"
-	    fi
-	    if [ $RESTART -gt 0 ] ; then
-		startstop_ganesha restart
-	    else
-		ctdb_counter_init "$service_name"
-	    fi
-	fi
-
-
 	# rquotad is sometimes not started correctly on RHEL5
 	# not a critical service so we dont flag the node as unhealthy
-	nfs_check_rpc_service "rquotad" 1 \
+	nfs_check_rpc_service "rquotad" \
 	    -gt 0 "verbose restart:b"
 
 	# Check that directories for shares actually exist.
diff --git a/config/functions b/config/functions
index d0d87ee..0679938 100755
--- a/config/functions
+++ b/config/functions
@@ -1241,7 +1241,9 @@ ctdb_replay_monitor_status ()
 	    ;;
 	*) : ;;  # Must be ERROR, do nothing special.
     esac
-    echo "$_err_out"
+    if [ -n "$_err_out" ] ; then
+	echo "$_err_out"
+    fi
     exit $_code
 }
 
diff --git a/doc/recovery-process.txt b/doc/recovery-process.txt
index 7cfc678..333eeb2 100644
--- a/doc/recovery-process.txt
+++ b/doc/recovery-process.txt
@@ -151,7 +151,7 @@ the recovery master also performs the following tests:
 16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL.
     If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart
     monitoring from 1.
-    "Node:%u was in recovery mode. Restart recovery process"
+    "Node:%u was in recovery mode. Start recovery process"
 
 17, Verify that the filehandle to the recovery lock file is valid.
     If it is not, this may mean a split brain and is a critical error.
diff --git a/include/ctdb_private.h b/include/ctdb_private.h
index 05109ac..17b8933 100644
--- a/include/ctdb_private.h
+++ b/include/ctdb_private.h
@@ -725,6 +725,7 @@ struct ctdb_fetch_handle {
 /* internal prototypes */
 void ctdb_set_error(struct ctdb_context *ctdb, const char *fmt, ...) PRINTF_ATTRIBUTE(2,3);
 void ctdb_fatal(struct ctdb_context *ctdb, const char *msg);
+void ctdb_die(struct ctdb_context *ctdb, const char *msg);
 void ctdb_external_trace(void);
 bool ctdb_same_address(struct ctdb_address *a1, struct ctdb_address *a2);
 int ctdb_parse_address(struct ctdb_context *ctdb,
@@ -1493,7 +1494,8 @@ void ctdb_run_notification_script(struct ctdb_context *ctdb, const char *event);
 void ctdb_fault_setup(void);
 
 int verify_remote_ip_allocation(struct ctdb_context *ctdb, 
-				struct ctdb_all_public_ips *ips);
+				struct ctdb_all_public_ips *ips,
+				uint32_t pnn);
 int update_ip_assignment_tree(struct ctdb_context *ctdb,
 				struct ctdb_public_ip *ip);
 
diff --git a/server/ctdb_daemon.c b/server/ctdb_daemon.c
index 478962d..cc09346 100644
--- a/server/ctdb_daemon.c
+++ b/server/ctdb_daemon.c
@@ -1032,8 +1032,7 @@ static void ctdb_setup_event_callback(struct ctdb_context *ctdb, int status,
 				      void *private_data)
 {
 	if (status != 0) {
-		DEBUG(DEBUG_ALERT,("Failed to run setup event - exiting\n"));
-		exit(1);
+		ctdb_die(ctdb, "Failed to run setup event");
 	}
 	ctdb_run_notification_script(ctdb, "setup");
 
@@ -1216,7 +1215,7 @@ int ctdb_start_daemon(struct ctdb_context *ctdb, bool do_fork, bool use_syslog,
 	ctdb_set_runstate(ctdb, CTDB_RUNSTATE_INIT);
 	ret = ctdb_event_script(ctdb, CTDB_EVENT_INIT);
 	if (ret != 0) {
-		ctdb_fatal(ctdb, "Failed to run init event\n");
+		ctdb_die(ctdb, "Failed to run init event\n");
 	}
 	ctdb_run_notification_script(ctdb, "init");
 
diff --git a/server/ctdb_recoverd.c b/server/ctdb_recoverd.c
index 310c334..ece1491 100644
--- a/server/ctdb_recoverd.c
+++ b/server/ctdb_recoverd.c
@@ -1433,57 +1433,62 @@ static int ctdb_reload_remote_public_ips(struct ctdb_context *ctdb,
 	}
 
 	for (j=0; j<nodemap->num; j++) {
+		/* For readability */
+		struct ctdb_node *node = ctdb->nodes[j];
+
 		/* release any existing data */
-		if (ctdb->nodes[j]->known_public_ips) {
-			talloc_free(ctdb->nodes[j]->known_public_ips);
-			ctdb->nodes[j]->known_public_ips = NULL;
+		if (node->known_public_ips) {
+			talloc_free(node->known_public_ips);
+			node->known_public_ips = NULL;
 		}
-		if (ctdb->nodes[j]->available_public_ips) {
-			talloc_free(ctdb->nodes[j]->available_public_ips);
-			ctdb->nodes[j]->available_public_ips = NULL;
+		if (node->available_public_ips) {
+			talloc_free(node->available_public_ips);
+			node->available_public_ips = NULL;
 		}
 
 		if (nodemap->nodes[j].flags & NODE_FLAGS_INACTIVE) {
 			continue;
 		}
 
-		/* grab a new shiny list of public ips from the node */
+		/* Retrieve the list of known public IPs from the node */
 		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
 					CONTROL_TIMEOUT(),
-					ctdb->nodes[j]->pnn,
+					node->pnn,
 					ctdb->nodes,
 					0,
-					&ctdb->nodes[j]->known_public_ips);
+					&node->known_public_ips);
 		if (ret != 0) {
-			DEBUG(DEBUG_ERR,("Failed to read known public ips from node : %u\n",
-				ctdb->nodes[j]->pnn));
+			DEBUG(DEBUG_ERR,
+			      ("Failed to read known public IPs from node: %u\n",
+			       node->pnn));
 			if (culprit) {
-				*culprit = ctdb->nodes[j]->pnn;
+				*culprit = node->pnn;
 			}
 			return -1;
 		}
 
-		if (ctdb->do_checkpublicip) {
-			if (rec->ip_check_disable_ctx == NULL) {
-				if (verify_remote_ip_allocation(ctdb, ctdb->nodes[j]->known_public_ips)) {
-					DEBUG(DEBUG_ERR,("Node %d has inconsistent public ip allocation and needs update.\n", ctdb->nodes[j]->pnn));
-					rec->need_takeover_run = true;
-				}
-			}
+		if (ctdb->do_checkpublicip &&
+		    (rec->ip_check_disable_ctx == NULL) &&
+		    verify_remote_ip_allocation(ctdb,
+						 node->known_public_ips,
+						 node->pnn)) {
+			DEBUG(DEBUG_ERR,("Trigger IP reallocation\n"));
+			rec->need_takeover_run = true;
 		}
 
-		/* grab a new shiny list of public ips from the node */
+		/* Retrieve the list of available public IPs from the node */
 		ret = ctdb_ctrl_get_public_ips_flags(ctdb,
 					CONTROL_TIMEOUT(),
-					ctdb->nodes[j]->pnn,
+					node->pnn,
 					ctdb->nodes,
 					CTDB_PUBLIC_IP_FLAGS_ONLY_AVAILABLE,
-					&ctdb->nodes[j]->available_public_ips);
+					&node->available_public_ips);
 		if (ret != 0) {
-			DEBUG(DEBUG_ERR,("Failed to read available public ips from node : %u\n",
-				ctdb->nodes[j]->pnn));
+			DEBUG(DEBUG_ERR,
+			      ("Failed to read available public IPs from node: %u\n",
+			       node->pnn));
 			if (culprit) {
-				*culprit = ctdb->nodes[j]->pnn;
+				*culprit = node->pnn;
 			}
 			return -1;
 		}
@@ -1843,9 +1848,7 @@ static int do_recovery(struct ctdb_recoverd *rec,
 
 	DEBUG(DEBUG_NOTICE, (__location__ " Recovery - disabled recovery mode\n"));
 
-	/*
-	  tell nodes to takeover their public IPs
-	 */
+	/* Fetch known/available public IPs from each active node */
 	ret = ctdb_reload_remote_public_ips(ctdb, rec, nodemap, &culprit);
 	if (ret != 0) {
 		DEBUG(DEBUG_ERR,("Failed to read public ips from remote node %d\n",
@@ -2728,7 +2731,7 @@ static void verify_recmode_normal_callback(struct ctdb_client_control_state *sta
 	   status field
 	*/
 	if (state->status != CTDB_RECOVERY_NORMAL) {
-		DEBUG(DEBUG_NOTICE, (__location__ " Node:%u was in recovery mode. Restart recovery process\n", state->c->hdr.destnode));
+		DEBUG(DEBUG_NOTICE, ("Node:%u was in recovery mode. Start recovery process\n", state->c->hdr.destnode));
 		rmdata->status = MONITOR_RECOVERY_NEEDED;
 	}
 
diff --git a/server/ctdb_takeover.c b/server/ctdb_takeover.c
index 401a8f3..1a15596 100644
--- a/server/ctdb_takeover.c
+++ b/server/ctdb_takeover.c
@@ -881,6 +881,14 @@ static void release_ip_callback(struct ctdb_context *ctdb, int status,
 		ctdb_ban_self(ctdb);
 	}
 
+	if (ctdb->do_checkpublicip && ctdb_sys_have_ip(state->addr)) {
+		DEBUG(DEBUG_ERR, ("IP %s still hosted during release IP callback, failing\n",
+				  ctdb_addr_to_str(state->addr)));
+		ctdb_request_control_reply(ctdb, state->c, NULL, -1, NULL);
+		talloc_free(state);
+		return;
+	}
+
 	/* send a message to all clients of this node telling them
 	   that the cluster has been reconfigured and they should
 	   release any sockets on this IP */
@@ -977,6 +985,21 @@ int32_t ctdb_control_release_ip(struct ctdb_context *ctdb,
 			DEBUG(DEBUG_ERR, ("Could not find which interface the ip address is hosted on. can not release it\n"));
 			return 0;
 		}
+		if (vnn->iface == NULL) {
+			DEBUG(DEBUG_WARNING,
+			      ("Public IP %s is hosted on interface %s but we have no VNN\n",
+			       ctdb_addr_to_str(&pip->addr),
+			       iface));
+		} else if (strcmp(iface, ctdb_vnn_iface_string(vnn)) != 0) {
+			DEBUG(DEBUG_WARNING,
+			      ("Public IP %s is hosted on inteterface %s but VNN says %s\n",
+			       ctdb_addr_to_str(&pip->addr),
+			       iface,
+			       ctdb_vnn_iface_string(vnn)));
+			/* Should we fix vnn->iface?  If we do, what
+			 * happens to reference counts?
+			 */
+		}
 	} else {
 		iface = strdup(ctdb_vnn_iface_string(vnn));
 	}
@@ -3188,6 +3211,7 @@ void ctdb_takeover_client_destructor_hook(struct ctdb_client *client)
 void ctdb_release_all_ips(struct ctdb_context *ctdb)
 {
 	struct ctdb_vnn *vnn;
+	int count = 0;
 
 	for (vnn=ctdb->vnn;vnn;vnn=vnn->next) {
 		if (!ctdb_sys_have_ip(&vnn->public_address)) {
@@ -3197,13 +3221,22 @@ void ctdb_release_all_ips(struct ctdb_context *ctdb)
 		if (!vnn->iface) {
 			continue;
 		}
+
+		DEBUG(DEBUG_INFO,("Release of IP %s/%u on interface %s node:-1\n",
+				    ctdb_addr_to_str(&vnn->public_address),
+				    vnn->public_netmask_bits,
+				    ctdb_vnn_iface_string(vnn)));
+
 		ctdb_event_script_args(ctdb, CTDB_EVENT_RELEASE_IP, "%s %s %u",
 				  ctdb_vnn_iface_string(vnn),
 				  ctdb_addr_to_str(&vnn->public_address),
 				  vnn->public_netmask_bits);
 		release_kill_clients(ctdb, &vnn->public_address);
 		ctdb_vnn_unassign_iface(ctdb, vnn);
+		count++;
 	}
+
+	DEBUG(DEBUG_NOTICE,(__location__ " Released %d public IPs\n", count));
 }
 
 
@@ -4234,7 +4267,9 @@ int32_t ctdb_control_ipreallocated(struct ctdb_context *ctdb,
    node has the expected ip allocation.
    This is verified against ctdb->ip_tree
 */
-int verify_remote_ip_allocation(struct ctdb_context *ctdb, struct ctdb_all_public_ips *ips)
+int verify_remote_ip_allocation(struct ctdb_context *ctdb,
+				struct ctdb_all_public_ips *ips,
+				uint32_t pnn)
 {
 	struct ctdb_public_ip_list *tmp_ip; 
 	int i;
@@ -4252,7 +4287,7 @@ int verify_remote_ip_allocation(struct ctdb_context *ctdb, struct ctdb_all_publi
 	for (i=0; i<ips->num; i++) {
 		tmp_ip = trbt_lookuparray32(ctdb->ip_tree, IP_KEYLEN, ip_key(&ips->ips[i].addr));
 		if (tmp_ip == NULL) {
-			DEBUG(DEBUG_ERR,(__location__ " Could not find host for address %s, reassign ips\n", ctdb_addr_to_str(&ips->ips[i].addr)));
+			DEBUG(DEBUG_ERR,("Node %u has new or unknown public IP %s\n", pnn, ctdb_addr_to_str(&ips->ips[i].addr)));
 			return -1;
 		}
 
@@ -4261,7 +4296,11 @@ int verify_remote_ip_allocation(struct ctdb_context *ctdb, struct ctdb_all_publi
 		}
 
 		if (tmp_ip->pnn != ips->ips[i].pnn) {
-			DEBUG(DEBUG_ERR,("Inconsistent ip allocation. Trigger reallocation. Thinks %s is held by node %u while it is held by node %u\n", ctdb_addr_to_str(&ips->ips[i].addr), ips->ips[i].pnn, tmp_ip->pnn));
+			DEBUG(DEBUG_ERR,
+			      ("Inconsistent IP allocation - node %u thinks %s is held by node %u while it is assigned to node %u\n",
+			       pnn,
+			       ctdb_addr_to_str(&ips->ips[i].addr),
+			       ips->ips[i].pnn, tmp_ip->pnn));
 			return -1;
 		}
 	}
@@ -4347,6 +4386,8 @@ static int ctdb_reloadips_child(struct ctdb_context *ctdb)
 	struct ctdb_vnn *vnn;
 	int i, ret;
 
+	CTDB_NO_MEMORY(ctdb, mem_ctx);
+
 	/* read the ip allocation from the local node */
 	ret = ctdb_ctrl_get_public_ips(ctdb, TAKEOVER_TIMEOUT(), CTDB_CURRENT_NODE, mem_ctx, &ips);
 	if (ret != 0) {
@@ -4361,7 +4402,7 @@ static int ctdb_reloadips_child(struct ctdb_context *ctdb)
 		DEBUG(DEBUG_ERR,("Failed to re-read public addresses file\n"));
 		talloc_free(mem_ctx);
 		return -1;
-	}		
+	}
 
 
 	/* check the previous list of ips and scan for ips that have been
@@ -4385,6 +4426,7 @@ static int ctdb_reloadips_child(struct ctdb_context *ctdb)
 
 			ret = ctdb_ctrl_del_public_ip(ctdb, TAKEOVER_TIMEOUT(), CTDB_CURRENT_NODE, &pub);
 			if (ret != 0) {
+				talloc_free(mem_ctx);
 				DEBUG(DEBUG_ERR, ("RELOADIPS: Unable to del public ip:%s from local node\n", ctdb_addr_to_str(&ips->ips[i].addr)));
 				return -1;
 			}
@@ -4400,15 +4442,15 @@ static int ctdb_reloadips_child(struct ctdb_context *ctdb)
 			}
 		}
 		if (i == ips->num) {
-			struct ctdb_control_ip_iface pub;
+			struct ctdb_control_ip_iface *pub;
 			const char *ifaces = NULL;
 			int iface = 0;
 
 			DEBUG(DEBUG_NOTICE,("RELOADIPS: New ip:%s found, adding it.\n", ctdb_addr_to_str(&vnn->public_address)));
 
-			pub.addr  = vnn->public_address;
-			pub.mask  = vnn->public_netmask_bits;
-
+			pub = talloc_zero(mem_ctx, struct ctdb_control_ip_iface);
+			pub->addr  = vnn->public_address;
+			pub->mask  = vnn->public_netmask_bits;


-- 
CTDB repository