[SCM] Samba Shared Repository - branch v4-8-test updated

Thu Sep 20 12:06:02 UTC 2018

The branch, v4-8-test has been updated
       via  189697a ctdb-recoverd: Set recovery lock handle at start of attempt
       via  21e4884 ctdb-recoverd: Handle cancellation when releasing recovery lock
       via  da9bb48 ctdb-recoverd: Return early when the recovery lock is not held
       via  72a8c69 ctdb-recoverd: Store recovery lock handle
       via  9745524 ctdb-recoverd: Use talloc() to allocate recovery lock handle
       via  a4c4386 ctdb-recoverd: Rename hold_reclock_state to ctdb_recovery_lock_handle
       via  9b1cc7a ctdb-recoverd: Re-check master on failure to take recovery lock
       via  43c1ad1 ctdb-recoverd: Clean up taking of recovery lock
       via  eb498ec ctdb-cluster-mutex: Block signals around fork
       via  1954a94 ctdb-cluster-mutex: Reset SIGTERM handler in cluster mutex child
      from  b29d90f wafsamba: Fix 'make -j<jobs>'

https://git.samba.org/?p=samba.git;a=shortlog;h=v4-8-test


- Log -----------------------------------------------------------------
commit 189697a98bac80d2d815193f8c87e632e4923448
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Sep 3 13:30:57 2018 +1000

    ctdb-recoverd: Set recovery lock handle at start of attempt
    
    This allows the attempt to be cancelled if an election is lost and an
    unlock is done before the attempt is completed.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    
    Autobuild-User(master): Martin Schwenke <martins at samba.org>
    Autobuild-Date(master): Tue Sep 18 02:18:30 CEST 2018 on sn-devel-144
    
    (cherry picked from commit 486022ef8f43251258f255ffa15f1a01bc6aa2b7)
    
    Autobuild-User(v4-8-test): Karolin Seeger <kseeger at samba.org>
    Autobuild-Date(v4-8-test): Thu Sep 20 14:05:59 CEST 2018 on sn-devel-144

commit 21e4884a1a3b25ee5fdf399b3d33bba3af004c99
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Sep 3 13:01:19 2018 +1000

    ctdb-recoverd: Handle cancellation when releasing recovery lock
    
    If the recovery lock is in the process of being taken then free the
    cluster mutex handle but leave the recovery lock handle in place.
    This allows ctdb_recovery_lock() to fail.
    
    Note that this isn't yet live because rec->recovery_lock_handle is
    still only set at the completion of the attempt to take the lock.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit b1dc5687844e90b0e3c39cb46a1116c86118fbf4)

commit da9bb48a459644801deac2514fc00c9c639c2ef5
Author: Martin Schwenke <martin at meltin.net>
Date:   Tue Sep 11 15:05:19 2018 +1000

    ctdb-recoverd: Return early when the recovery lock is not held
    
    This makes upcoming changes simpler.
    
    Update to modern debug macro while touching relevant line.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit a755d060c13b65dfb6d73979aaf111c489882bfb)

commit 72a8c69935a7a5380689abd9a98df81291f6c812
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Sep 3 12:39:32 2018 +1000

    ctdb-recoverd: Store recovery lock handle
    
    ... not just cluster mutex handle.
    
    This makes the recovery lock handle long-lived and with allow the
    releasing code to cancel an in-progress attempt to take the recovery
    lock.
    
    The cluster mutex handle is now allocated off the recovery lock
    handle.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit c52216740bd81b68876de06e104822bbbca86df9)

commit 9745524234ab57aa8a895fb90a5dad35d9064ce4
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Sep 3 11:43:44 2018 +1000

    ctdb-recoverd: Use talloc() to allocate recovery lock handle
    
    At the moment this is still local and is freed after the mutex is
    successfully taken.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit a53b264aee7d620ee8ecf9114b0014c5bb678484)

commit a4c438601a623d2b47023621a9f0e341135192d4
Author: Martin Schwenke <martin at meltin.net>
Date:   Mon Sep 3 11:30:06 2018 +1000

    ctdb-recoverd: Rename hold_reclock_state to ctdb_recovery_lock_handle
    
    This will be a longer lived structure.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit af22f03dbe9040f5f743eb85bb50d411269bbab4)

commit 9b1cc7afee8181844da94faeabcedc2140d10b89
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Sep 9 08:30:50 2018 +1000

    ctdb-recoverd: Re-check master on failure to take recovery lock
    
    If the master changed while trying to take the lock then fail gracefully.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit c516e58ce92c420dc993bd9b7f1433641bd764bd)

commit 43c1ad1537debb538bfa7d304584ddb84344a379
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Sep 9 08:27:46 2018 +1000

    ctdb-recoverd: Clean up taking of recovery lock
    
    No functional changes, just coding style cleanups and debug message
    tweaks.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 59fc01646c7d65ba90b0a1a34c3795ff842351c5)

commit eb498ec2baea154bdba313fc495a9dc7e0322f77
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Sep 12 17:51:47 2018 +1000

    ctdb-cluster-mutex: Block signals around fork
    
    If SIGTERM is received and the tevent signal handler setup in the
    recovery daemon is still enabled then the signal is handled and a
    corresponding event is queued.  The child never runs an event loop so
    the signal is effectively ignored.
    
    Resetting the SIGTERM handler isn't enough.  A signal can arrive
    before that.
    
    Block SIGTERM before forking and then immediately unblock it in the
    parent.
    
    In the child, unblock SIGTERM after the signal handler is reset.  An
    explicit unblock is needed because according to sigprocmask(2) "the
    signal mask is preserved across execve(2)".
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit e789d0da57fc3fc6d22bfa00577a2e65034ca27a)

commit 1954a94203789422b163b4f257aa3c9ab83fa9a2
Author: Martin Schwenke <martin at meltin.net>
Date:   Wed Sep 12 14:18:00 2018 +1000

    ctdb-cluster-mutex: Reset SIGTERM handler in cluster mutex child
    
    If SIGTERM is received and the tevent signal handler setup in the
    recovery daemon is still enabled then the signal is handled and a
    corresponding event is queued.  The child never runs an event loop so
    the signal is effectively ignored.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=13617
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 5a6b139884f08ee2ee10f9d16fe56ad8fb5352a6)

-----------------------------------------------------------------------

Summary of changes:
 ctdb/server/ctdb_cluster_mutex.c |  32 +++++++++++
 ctdb/server/ctdb_recoverd.c      | 120 +++++++++++++++++++++++++++------------
 2 files changed, 115 insertions(+), 37 deletions(-)


Changeset truncated at 500 lines:

diff --git a/ctdb/server/ctdb_cluster_mutex.c b/ctdb/server/ctdb_cluster_mutex.c
index 804c6d5..330d5fd 100644
--- a/ctdb/server/ctdb_cluster_mutex.c
+++ b/ctdb/server/ctdb_cluster_mutex.c
@@ -196,6 +196,7 @@ ctdb_cluster_mutex(TALLOC_CTX *mem_ctx,
 {
 	struct ctdb_cluster_mutex_handle *h;
 	char **args;
+	sigset_t sigset_term;
 	int ret;
 
 	h = talloc(mem_ctx, struct ctdb_cluster_mutex_handle);
@@ -225,15 +226,41 @@ ctdb_cluster_mutex(TALLOC_CTX *mem_ctx,
 		return NULL;
 	}
 
+	sigemptyset(&sigset_term);
+	sigaddset(&sigset_term, SIGTERM);
+	ret = sigprocmask(SIG_BLOCK, &sigset_term, NULL);
+	if (ret != 0) {
+		DBG_WARNING("Failed to block SIGTERM (%d)\n", errno);
+	}
+
 	h->child = ctdb_fork(ctdb);
 	if (h->child == (pid_t)-1) {
 		close(h->fd[0]);
 		close(h->fd[1]);
 		talloc_free(h);
+		ret = sigprocmask(SIG_UNBLOCK, &sigset_term, NULL);
+		if (ret != 0) {
+			DBG_WARNING("Failed to unblock SIGTERM (%d)\n", errno);
+		}
 		return NULL;
 	}
 
 	if (h->child == 0) {
+		struct sigaction sa = {
+			.sa_handler = SIG_DFL,
+		};
+
+		ret = sigaction(SIGTERM, &sa, NULL);
+		if (ret != 0) {
+			DBG_WARNING("Failed to reset signal handler (%d)\n",
+				    errno);
+		}
+
+		ret = sigprocmask(SIG_UNBLOCK, &sigset_term, NULL);
+		if (ret != 0) {
+			DBG_WARNING("Failed to unblock SIGTERM (%d)\n", errno);
+		}
+
 		/* Make stdout point to the pipe */
 		close(STDOUT_FILENO);
 		dup2(h->fd[1], STDOUT_FILENO);
@@ -248,6 +275,11 @@ ctdb_cluster_mutex(TALLOC_CTX *mem_ctx,
 
 	/* Parent */
 
+	ret = sigprocmask(SIG_UNBLOCK, &sigset_term, NULL);
+	if (ret != 0) {
+		DBG_WARNING("Failed to unblock SIGTERM (%d)\n", errno);
+	}
+
 	DEBUG(DEBUG_DEBUG, (__location__ " Created PIPE FD:%d\n", h->fd[0]));
 	set_close_on_exec(h->fd[0]);
 
diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index 2b94fed..62e4c46 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -237,6 +237,8 @@ struct ctdb_banning_state {
 	struct timeval last_reported_time;
 };
 
+struct ctdb_recovery_lock_handle;
+
 /*
   private state of recovery daemon
  */
@@ -258,7 +260,7 @@ struct ctdb_recoverd {
 	uint32_t *force_rebalance_nodes;
 	struct ctdb_node_capabilities *caps;
 	bool frozen_on_inactive;
-	struct ctdb_cluster_mutex_handle *recovery_lock_handle;
+	struct ctdb_recovery_lock_handle *recovery_lock_handle;
 };
 
 #define CONTROL_TIMEOUT() timeval_current_ofs(ctdb->tunable.recover_timeout, 0)
@@ -879,18 +881,19 @@ static bool ctdb_recovery_have_lock(struct ctdb_recoverd *rec)
 	return (rec->recovery_lock_handle != NULL);
 }
 
-struct hold_reclock_state {
+struct ctdb_recovery_lock_handle {
 	bool done;
 	bool locked;
 	double latency;
+	struct ctdb_cluster_mutex_handle *h;
 };
 
 static void take_reclock_handler(char status,
 				 double latency,
 				 void *private_data)
 {
-	struct hold_reclock_state *s =
-		(struct hold_reclock_state *) private_data;
+	struct ctdb_recovery_lock_handle *s =
+		(struct ctdb_recovery_lock_handle *) private_data;
 
 	switch (status) {
 	case '0':
@@ -930,41 +933,68 @@ static bool ctdb_recovery_lock(struct ctdb_recoverd *rec)
 {
 	struct ctdb_context *ctdb = rec->ctdb;
 	struct ctdb_cluster_mutex_handle *h;
-	struct hold_reclock_state s = {
-		.done = false,
-		.locked = false,
-		.latency = 0,
+	struct ctdb_recovery_lock_handle *s;
+
+	s = talloc_zero(rec, struct ctdb_recovery_lock_handle);
+	if (s == NULL) {
+		DBG_ERR("Memory allocation error\n");
+		return false;
 	};
 
-	h = ctdb_cluster_mutex(rec, ctdb, ctdb->recovery_lock, 0,
-			       take_reclock_handler, &s,
-			       lost_reclock_handler, rec);
+	h = ctdb_cluster_mutex(s,
+			       ctdb,
+			       ctdb->recovery_lock,
+			       0,
+			       take_reclock_handler,
+			       s,
+			       lost_reclock_handler,
+			       rec);
 	if (h == NULL) {
+		talloc_free(s);
 		return false;
 	}
 
-	while (!s.done) {
+	rec->recovery_lock_handle = s;
+	s->h = h;
+
+	while (! s->done) {
 		tevent_loop_once(ctdb->ev);
 	}
 
-	if (! s.locked) {
-		talloc_free(h);
+	if (! s->locked) {
+		TALLOC_FREE(rec->recovery_lock_handle);
 		return false;
 	}
 
-	rec->recovery_lock_handle = h;
-	ctdb_ctrl_report_recd_lock_latency(ctdb, CONTROL_TIMEOUT(),
-					   s.latency);
+	ctdb_ctrl_report_recd_lock_latency(ctdb,
+					   CONTROL_TIMEOUT(),
+					   s->latency);
 
 	return true;
 }
 
 static void ctdb_recovery_unlock(struct ctdb_recoverd *rec)
 {
-	if (rec->recovery_lock_handle != NULL) {
-		DEBUG(DEBUG_NOTICE, ("Releasing recovery lock\n"));
-		TALLOC_FREE(rec->recovery_lock_handle);
+	if (rec->recovery_lock_handle == NULL) {
+		return;
+	}
+
+	if (! rec->recovery_lock_handle->done) {
+		/*
+		 * Taking of recovery lock still in progress.  Free
+		 * the cluster mutex handle to release it but leave
+		 * the recovery lock handle in place to allow taking
+		 * of the lock to fail.
+		 */
+		D_NOTICE("Cancelling recovery lock\n");
+		TALLOC_FREE(rec->recovery_lock_handle->h);
+		rec->recovery_lock_handle->done = true;
+		rec->recovery_lock_handle->locked = false;
+		return;
 	}
+
+	D_NOTICE("Releasing recovery lock\n");
+	TALLOC_FREE(rec->recovery_lock_handle);
 }
 
 static void ban_misbehaving_nodes(struct ctdb_recoverd *rec, bool *self_ban)
@@ -1305,31 +1335,47 @@ static int do_recovery(struct ctdb_recoverd *rec,
 		goto fail;
 	}
 
-        if (ctdb->recovery_lock != NULL) {
+	if (ctdb->recovery_lock != NULL) {
 		if (ctdb_recovery_have_lock(rec)) {
-			DEBUG(DEBUG_NOTICE, ("Already holding recovery lock\n"));
+			D_NOTICE("Already holding recovery lock\n");
 		} else {
-			DEBUG(DEBUG_NOTICE, ("Attempting to take recovery lock (%s)\n",
-					     ctdb->recovery_lock));
-			if (!ctdb_recovery_lock(rec)) {
-				if (ctdb->runstate == CTDB_RUNSTATE_FIRST_RECOVERY) {
-					/* If ctdb is trying first recovery, it's
-					 * possible that current node does not know
-					 * yet who the recmaster is.
+			bool ok;
+
+			D_NOTICE("Attempting to take recovery lock (%s)\n",
+				 ctdb->recovery_lock);
+
+			ok = ctdb_recovery_lock(rec);
+			if (! ok) {
+				D_ERR("Unable to take recovery lock\n");
+
+				if (pnn != rec->recmaster) {
+					D_NOTICE("Recovery master changed to %u,"
+						 " aborting recovery\n",
+						 rec->recmaster);
+					rec->need_recovery = false;
+					goto fail;
+				}
+
+				if (ctdb->runstate ==
+				    CTDB_RUNSTATE_FIRST_RECOVERY) {
+					/*
+					 * First recovery?  Perhaps
+					 * current node does not yet
+					 * know who the recmaster is.
 					 */
-					DEBUG(DEBUG_ERR, ("Unable to get recovery lock"
-							  " - retrying recovery\n"));
+					D_ERR("Retrying recovery\n");
 					goto fail;
 				}
 
-				DEBUG(DEBUG_ERR,("Unable to get recovery lock - aborting recovery "
-						 "and ban ourself for %u seconds\n",
-						 ctdb->tunable.recovery_ban_period));
-				ctdb_ban_node(rec, pnn, ctdb->tunable.recovery_ban_period);
+				D_ERR("Abort recovery, "
+				      "ban this node for %u seconds\n",
+				      ctdb->tunable.recovery_ban_period);
+				ctdb_ban_node(rec,
+					      pnn,
+					      ctdb->tunable.recovery_ban_period);
 				goto fail;
 			}
-			DEBUG(DEBUG_NOTICE,
-			      ("Recovery lock taken successfully by recovery daemon\n"));
+			D_NOTICE("Recovery lock taken successfully\n");
 		}
 	}
 


-- 
Samba Shared Repository