[SCM] Samba Shared Repository - branch v4-16-test updated

Tue Feb 15 09:56:01 UTC 2022

The branch, v4-16-test has been updated
       via  79b42f0f2bf ctdb-tests: Add a test for stalled node triggering election
       via  f3047e90a86 ctdb-tests: Factor out functions to detect when generation changes
       via  d0133dd3a54 ctdb-recoverd: Consistently log start of election
       via  ddda97dc146 ctdb-recoverd: Always send unknown leader broadcast when starting election
       via  758e953ee07 ctdb-recoverd: Consistently have caller set election-in-progress
       via  07540a8cf45 ctdb-recoverd: Always cancel election in progress
      from  caa6785eff0 VERSION: Bump version up to Samba 4.16.0rc4...

https://git.samba.org/?p=samba.git;a=shortlog;h=v4-16-test


- Log -----------------------------------------------------------------
commit 79b42f0f2bfa539c66ca46adba8383e2465af783
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jan 23 07:08:02 2022 +1100

    ctdb-tests: Add a test for stalled node triggering election
    
    A stalled node probably continues to hold the cluster lock, so confirm
    elections work in this case.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    
    Autobuild-User(master): Amitay Isaacs <amitay at samba.org>
    Autobuild-Date(master): Mon Feb 14 02:46:01 UTC 2022 on sn-devel-184
    
    (cherry picked from commit 331c435ce520bef1274e076e6ed491400db3b5ad)
    
    Autobuild-User(v4-16-test): Jule Anger <janger at samba.org>
    Autobuild-Date(v4-16-test): Tue Feb 15 09:55:38 UTC 2022 on sn-devel-184

commit f3047e90a8653284f19ef7138ddbe9ada3b7a303
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jan 23 06:42:52 2022 +1100

    ctdb-tests: Factor out functions to detect when generation changes
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 265e44abc42e1f5b7fef6550cd748459dbef80cb)

commit d0133dd3a54acc29949e8351702b0996ba8d66c6
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jan 23 06:21:51 2022 +1100

    ctdb-recoverd: Consistently log start of election
    
    Elections should now be quite rare, so always log when one begins.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 0e74e03c9cf83d5dc2d97fa9f38ff8fbaa3d2685)

commit ddda97dc146179a035485219bca6af2338b360e9
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jan 23 06:18:51 2022 +1100

    ctdb-recoverd: Always send unknown leader broadcast when starting election
    
    This is currently missed when the cluster lock is lost.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit bf55a0117d045e8ca888f7e01591cc2a2bce9223)

commit 758e953ee07343e1e3fd0389eb2d82c0654be61c
Author: Martin Schwenke <martin at meltin.net>
Date:   Sun Jan 23 05:49:18 2022 +1100

    ctdb-recoverd: Consistently have caller set election-in-progress
    
    The problem here is that election-in-progress must be set to
    potentially avoid restarting the election broadcast timeout in
    main_loop(), so this is already done by leader_handler().
    
    Have force_election() set election-in-progress for all election types
    and do not bother setting it in cluster_lock_election().
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 9b3fab052bd2dccf2fc3fe9bd2b4354dff0b9ebb)

commit 07540a8cf4597f683e6661cc4418b858f59d7312
Author: Martin Schwenke <martin at meltin.net>
Date:   Fri Jan 21 18:09:47 2022 +1100

    ctdb-recoverd: Always cancel election in progress
    
    Election-in-progress is set by unknown leader broadcast, so needs to
    be cleared in all cases when election completes.
    
    This was seen in a case where the leader node stalled, so didn't send
    leader broadcasts for some time.  The node continued to hold the
    cluster lock, so another node could not become leader.  However, after
    the node returned to normal it still did not send leader broadcasts
    because election-in-progress was never cleared.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14958
    
    Signed-off-by: Martin Schwenke <martin at meltin.net>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 188a9021565bc2c1bec1d7a4830d6f47cdbc44a9)

-----------------------------------------------------------------------

Summary of changes:
 ctdb/server/ctdb_recoverd.c                        | 17 ++++----
 .../simple/cluster.015.reclock_remove_lock.sh      | 14 +------
 .../cluster.030.node_stall_leader_timeout.sh       | 48 ++++++++++++++++++++++
 ctdb/tests/scripts/integration.bash                | 44 ++++++++++++++++++++
 4 files changed, 103 insertions(+), 20 deletions(-)
 create mode 100755 ctdb/tests/INTEGRATION/simple/cluster.030.node_stall_leader_timeout.sh


Changeset truncated at 500 lines:

diff --git a/ctdb/server/ctdb_recoverd.c b/ctdb/server/ctdb_recoverd.c
index cc239959c56..03698ef2928 100644
--- a/ctdb/server/ctdb_recoverd.c
+++ b/ctdb/server/ctdb_recoverd.c
@@ -1836,7 +1836,7 @@ static void cluster_lock_election(struct ctdb_recoverd *rec)
 		if (cluster_lock_held(rec)) {
 			cluster_lock_release(rec);
 		}
-		return;
+		goto done;
 	}
 
 	/*
@@ -1844,11 +1844,10 @@ static void cluster_lock_election(struct ctdb_recoverd *rec)
 	 * attempt to retake it.  This provides stability.
 	 */
 	if (cluster_lock_held(rec)) {
-		return;
+		goto done;
 	}
 
 	rec->leader = CTDB_UNKNOWN_PNN;
-	rec->election_in_progress = true;
 
 	ok = cluster_lock_take(rec);
 	if (ok) {
@@ -1856,6 +1855,7 @@ static void cluster_lock_election(struct ctdb_recoverd *rec)
 		D_WARNING("Took cluster lock, leader=%"PRIu32"\n", rec->leader);
 	}
 
+done:
 	rec->election_in_progress = false;
 }
 
@@ -1867,7 +1867,7 @@ static void force_election(struct ctdb_recoverd *rec)
 	int ret;
 	struct ctdb_context *ctdb = rec->ctdb;
 
-	DEBUG(DEBUG_INFO,(__location__ " Force an election\n"));
+	D_ERR("Start election\n");
 
 	/* set all nodes to recovery mode to stop all internode traffic */
 	ret = set_recovery_mode(ctdb, rec, rec->nodemap, CTDB_RECOVERY_ACTIVE);
@@ -1876,13 +1876,16 @@ static void force_election(struct ctdb_recoverd *rec)
 		return;
 	}
 
+	rec->election_in_progress = true;
+	/* Let other nodes know that an election is underway */
+	leader_broadcast_send(rec, CTDB_UNKNOWN_PNN);
+
 	if (cluster_lock_enabled(rec)) {
 		cluster_lock_election(rec);
 		return;
 	}
 
 	talloc_free(rec->election_timeout);
-	rec->election_in_progress = true;
 	rec->election_timeout = tevent_add_timer(
 			ctdb->ev, ctdb,
 			fast_start ?
@@ -1975,10 +1978,8 @@ static void leader_broadcast_timeout_handler(struct tevent_context *ev,
 
 	rec->leader_broadcast_timeout_te = NULL;
 
-	/* Let other nodes know that an election is underway */
-	leader_broadcast_send(rec, CTDB_UNKNOWN_PNN);
+	D_NOTICE("Leader broadcast timeout\n");
 
-	D_NOTICE("Leader broadcast timeout. Force election\n");
 	force_election(rec);
 }
 
diff --git a/ctdb/tests/INTEGRATION/simple/cluster.015.reclock_remove_lock.sh b/ctdb/tests/INTEGRATION/simple/cluster.015.reclock_remove_lock.sh
index 35363d11f1d..2283c30edbf 100755
--- a/ctdb/tests/INTEGRATION/simple/cluster.015.reclock_remove_lock.sh
+++ b/ctdb/tests/INTEGRATION/simple/cluster.015.reclock_remove_lock.sh
@@ -56,24 +56,14 @@ echo
 
 leader_get "$test_node"
 
-echo "Get initial generation"
-ctdb_onnode "$test_node" status
-# shellcheck disable=SC2154
-# $outfile set by ctdb_onnode() above
-generation_init=$(sed -n -e 's/^Generation:\([0-9]*\)/\1/p' "$outfile")
-echo "Initial generation is ${generation_init}"
-echo
+generation_get
 
 echo "Remove recovery lock"
 rm "$reclock"
 echo
 
 # This will mean an election has taken place and a recovery has occured
-echo "Wait until generation changes"
-wait_until 30 generation_has_changed "$test_node" "$generation_init"
-echo
-echo "Generation changed to ${generation_new}"
-echo
+wait_until_generation_has_changed "$test_node"
 
 # shellcheck disable=SC2154
 # $leader set by leader_get() above
diff --git a/ctdb/tests/INTEGRATION/simple/cluster.030.node_stall_leader_timeout.sh b/ctdb/tests/INTEGRATION/simple/cluster.030.node_stall_leader_timeout.sh
new file mode 100755
index 00000000000..7bca58c222b
--- /dev/null
+++ b/ctdb/tests/INTEGRATION/simple/cluster.030.node_stall_leader_timeout.sh
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+
+# Verify that nothing bad occurs if a node stalls and the leader
+# broadcast timeout triggers
+
+. "${TEST_SCRIPTS_DIR}/integration.bash"
+
+set -e
+
+ctdb_test_init
+
+select_test_node
+echo
+
+echo 'Get "leader timeout":'
+conf_tool="${CTDB_SCRIPTS_HELPER_BINDIR}/ctdb-config"
+# shellcheck disable=SC2154
+# $test_node set by select_test_node() above
+try_command_on_node "$test_node" "${conf_tool} get cluster 'leader timeout'"
+# shellcheck disable=SC2154
+# $out set by ctdb_onnode() above
+leader_timeout="$out"
+echo "Leader timeout is ${leader_timeout} seconds"
+echo
+
+# Assume leader timeout is reasonable and doesn't cause node to be
+# disconnected
+stall_time=$((leader_timeout * 2))
+
+generation_get "$test_node"
+
+echo "Get ctdbd PID on node ${test_node}..."
+ctdb_onnode -v "$test_node" "getpid"
+ctdbd_pid="$out"
+echo
+
+echo "Sending SIGSTOP to ctdbd on ${test_node}"
+try_command_on_node "$test_node" "kill -STOP ${ctdbd_pid}"
+
+sleep_for "$stall_time"
+
+echo "Sending SIGCONT to ctdbd on ${test_node}"
+try_command_on_node "$test_node" "kill -CONT ${ctdbd_pid}"
+echo
+
+wait_until_generation_has_changed "$test_node"
+
+cluster_is_healthy
diff --git a/ctdb/tests/scripts/integration.bash b/ctdb/tests/scripts/integration.bash
index 25ee4d945cc..eb3db1e1849 100644
--- a/ctdb/tests/scripts/integration.bash
+++ b/ctdb/tests/scripts/integration.bash
@@ -688,6 +688,50 @@ wait_until_leader_has_changed ()
 
 #######################################
 
+# sets: generation
+_generation_get ()
+{
+	local node="$1"
+
+	ctdb_onnode "$node" status
+	# shellcheck disable=SC2154
+	# $outfile set by ctdb_onnode() above
+	generation=$(sed -n -e 's/^Generation:\([0-9]*\)/\1/p' "$outfile")
+}
+
+generation_get ()
+{
+	local node="$1"
+
+	echo "Get generation"
+	_generation_get "$node"
+	echo "Generation is ${generation}"
+	echo
+}
+
+_generation_has_changed ()
+{
+	local node="$1"
+	local generation_old="$2"
+
+	_generation_get "$node"
+
+	[ "$generation" != "$generation_old" ]
+}
+
+# uses: generation
+wait_until_generation_has_changed ()
+{
+	local node="$1"
+
+	echo "Wait until generation changes..."
+	wait_until 30 _generation_has_changed "$node" "$generation"
+	echo "Generation changed to ${generation}"
+	echo
+}
+
+#######################################
+
 wait_for_monitor_event ()
 {
     local pnn="$1"


-- 
Samba Shared Repository