[SCM] CTDB repository - branch master updated - ctdb-2.1-60-g06ac62f

Amitay Isaacs amitay at samba.org
Wed Apr 17 05:17:02 MDT 2013


The branch, master has been updated
       via  06ac62f890299021220214327f1b611c3cf00145 (commit)
       via  b1577a11d548479ff1a05702d106af9465921ad4 (commit)
       via  2438f3a4944f7adbcae4cc1b9d5452714244afe7 (commit)
       via  cad3107b12e8392f786f9a758ee38cf3a3d58538 (commit)
       via  feb1d40b21a160737aead22e398f3c34ff3be8de (commit)
      from  4c0cbfbe8b19f2e6fe17093b52c734bec63dd8b7 (commit)

http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit 06ac62f890299021220214327f1b611c3cf00145
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 17 13:08:49 2013 +0200

    tests: add a comment to recovery db corruption test
    
    The comment explains that we use "ctdb stop" and "ctdb continue"
    but we should use "ctdb setcrecmasterrole off".
    
    Signed-off-by: Michael Adam <obnox at samba.org>

commit b1577a11d548479ff1a05702d106af9465921ad4
Author: Amitay Isaacs <amitay at gmail.com>
Date:   Thu Apr 11 16:59:36 2013 +1000

    tests: Add a test for subsequent recoveries corrupting databases
    
    Signed-off-by: Amitay Isaacs <amitay at gmail.com>
    Reviewed-by: Michael Adam <obnox at samba.org>

commit 2438f3a4944f7adbcae4cc1b9d5452714244afe7
Author: Amitay Isaacs <amitay at gmail.com>
Date:   Thu Apr 11 16:58:34 2013 +1000

    tests: Support waiting for "recovered" state in tests
    
    Signed-off-by: Amitay Isaacs <amitay at gmail.com>
    Reviewed-by: Michael Adam <obnox at samba.org>

commit cad3107b12e8392f786f9a758ee38cf3a3d58538
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 12:02:59 2013 +0200

    ctdb_call: don't bump the rsn in ctdb_become_dmaster() any more
    
    This is now done in ctdb_ltdb_store_server(), so this
    extra bump can be spared.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>

commit feb1d40b21a160737aead22e398f3c34ff3be8de
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 11:40:25 2013 +0200

    Fix a severe recovery bug that can lead to data corruption for SMB clients.
    
    Problem:
    Recovery can under certain circumstances lead to old record copies
    resurrecting: Recovery selects the newest record copy purely by RSN. At
    the end of the recovery, the recovery master is the dmaster for all
    records in all (non-persistent) databases. And the other nodes locally
    hold the complete copy of the databases. The bug is that the recovery
    process does not increment the RSN on the recovery master at the end of
    the recovery. Now clients acting directly on the Recovery master will
    directly change a record's content on the recmaster without migration
    and hence without RSN bump.  So a subsequent recovery can not tell that
    the recmaster's copy is newer than the copies on the other nodes, since
    their RSN is the same. Hence, if the recmaster is not node 0 (or more
    precisely not the active node with the lowest node number), the recovery
    will choose copies from nodes with lower number and stick to these.
    
    Here is how to reproduce:
    
    - assume we have a cluster with at least 2 nodes
    - ensure that the recmaster is not node 0
      (maybe ensure with "onnode 0 ctdb setrecmasterrole off")
      say recmaster is node 1
    - choose a new database name, say "test1.tdb"
      (make sure it is not yet attached as persistent)
    - choose a key name, say "key1"
    - all clustere nodes should ok and no recovery running
    - now do the following on node 1:
    
    1. dbwrap_tool test1.tdb store key1 uint32 1
    2. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 1
    3. ctdb recover
    4. dbwrap_tool test1.tdb store key1 uint32 2
    5. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 2
    4. ctdb recover
    7. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 1
       ==> BUG
    
    This is a very severe bug, since when applied to Samba's locking.tdb
    database, it means that for SMB clients on clustered Samba there is
    the potential for locking out oneself from previously opened files
    or even worse, data corruption:
    
    Case 1: locking out
    
    - client on recmaster opens file
    - recovery propagates open file handle (entry in locking.tdb) to
      other nodes
    - client closes file
    - client opens the same file
    - recovery resurrects old copy of open file record in locking.tdb
      from lower node
    - client closes file but fails to delete entry in locking.tdb
    - client tries to open same file again but fails, since
      the old record locks it out (since the client is still connected)
    
    Case 2: data corruption
    
    - clien1 on recmaster opens file
    - recovery propagates open file info to other nodes
    - client1 closes the file and disconnects
    - client2 opens the same file
    - recovery resurrects old copy of locking.tdb record,
      where client2 has no entry, but client1 has.
    - but client2 believes it still has a handle
    - client3 opens the file and succees without
      conflicting with client2
      (the detached entry for client1 is discarded because
       the server does not exist any more).
    => both client2 and client3 believe they have exclusive
      access to the file and writing creates data corruption
    
    Fix:
    
    When storing a record on the dmaster, bump its RSN.
    
    The ctdb_ltdb_store_server() is the central function for storing
    a record to a local tdb from the ctdbd server context.
    So this is also the place where the RSN of the record to be stored
    should be incremented, when storing on the dmaster.
    
    For the case of the record migration, this is currently done in
    ctdb_become_dmaster() in ctdb_call.c, but there are other places
    such as in recovery, where we should bump the RSN, but currently
    don't do it.
    
    So moving the RSN incrementation into ctdb_ltdb_store_server fixes
    the recovery-record-resurrection bug.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>

-----------------------------------------------------------------------

Summary of changes:
 server/ctdb_call.c                  |    2 +-
 server/ctdb_ltdb_server.c           |    9 ++-
 tests/scripts/integration.bash      |    5 +-
 tests/simple/77_ctdb_db_recovery.sh |  133 +++++++++++++++++++++++++++++++++++
 4 files changed, 144 insertions(+), 5 deletions(-)
 create mode 100755 tests/simple/77_ctdb_db_recovery.sh


Changeset truncated at 500 lines:

diff --git a/server/ctdb_call.c b/server/ctdb_call.c
index 67219c4..a6c6389 100644
--- a/server/ctdb_call.c
+++ b/server/ctdb_call.c
@@ -343,7 +343,7 @@ static void ctdb_become_dmaster(struct ctdb_db_context *ctdb_db,
 	DEBUG(DEBUG_DEBUG,("pnn %u dmaster response %08x\n", ctdb->pnn, ctdb_hash(&key)));
 
 	ZERO_STRUCT(header);
-	header.rsn = rsn + 1;
+	header.rsn = rsn;
 	header.dmaster = ctdb->pnn;
 	header.flags = record_flags;
 
diff --git a/server/ctdb_ltdb_server.c b/server/ctdb_ltdb_server.c
index 0432e49..c5d9be1 100644
--- a/server/ctdb_ltdb_server.c
+++ b/server/ctdb_ltdb_server.c
@@ -125,12 +125,15 @@ static int ctdb_ltdb_store_server(struct ctdb_db_context *ctdb_db,
 	}
 
 	if (keep) {
-		if ((data.dsize == 0) &&
-		    !ctdb_db->persistent &&
+		if (!ctdb_db->persistent &&
 		    (ctdb_db->ctdb->pnn == header->dmaster) &&
 		    !(header->flags & (CTDB_REC_RO_HAVE_DELEGATIONS|CTDB_REC_RO_HAVE_READONLY|CTDB_REC_RO_REVOKING_READONLY|CTDB_REC_RO_REVOKE_COMPLETE)))
 		{
-			schedule_for_deletion = true;
+			header->rsn++;
+
+			if (data.dsize == 0) {
+				schedule_for_deletion = true;
+			}
 		}
 		remove_from_delete_queue = !schedule_for_deletion;
 	}
diff --git a/tests/scripts/integration.bash b/tests/scripts/integration.bash
index 2e5fb37..2ec827c 100644
--- a/tests/scripts/integration.bash
+++ b/tests/scripts/integration.bash
@@ -370,7 +370,7 @@ node_has_status ()
     local pnn="$1"
     local status="$2"
 
-    local bits fpat mpat
+    local bits fpat mpat rpat
     case "$status" in
 	(unhealthy)    bits="?:?:?:1:*" ;;
 	(healthy)      bits="?:?:?:0:*" ;;
@@ -386,6 +386,7 @@ node_has_status ()
 	(unfrozen)     fpat='^[[:space:]]+frozen[[:space:]]+0$' ;;
 	(monon)        mpat='^Monitoring mode:ACTIVE \(0\)$' ;;
 	(monoff)       mpat='^Monitoring mode:DISABLED \(1\)$' ;;
+	(recovered)    rpat='^Recovery mode:NORMAL \(0\)$' ;;
 	*)
 	    echo "node_has_status: unknown status \"$status\""
 	    return 1
@@ -410,6 +411,8 @@ node_has_status ()
 	$CTDB statistics -n "$pnn" | egrep -q "$fpat"
     elif [ -n "$mpat" ] ; then
 	$CTDB getmonmode -n "$pnn" | egrep -q "$mpat"
+    elif [ -n "$rpat" ] ; then
+        $CTDB status -n "$pnn" | egrep -q "$rpat"
     else
 	echo 'node_has_status: unknown mode, neither $bits nor $fpat is set'
 	return 1
diff --git a/tests/simple/77_ctdb_db_recovery.sh b/tests/simple/77_ctdb_db_recovery.sh
new file mode 100755
index 0000000..00fa096
--- /dev/null
+++ b/tests/simple/77_ctdb_db_recovery.sh
@@ -0,0 +1,133 @@
+#!/bin/bash
+
+test_info()
+{
+    cat <<EOF
+Recovery can under certain circumstances lead to old record copies
+resurrecting: Recovery selects the newest record copy purely by RSN. At
+the end of the recovery, the recovery master is the dmaster for all
+records in all (non-persistent) databases. And the other nodes locally
+hold the complete copy of the databases. The bug is that the recovery
+process does not increment the RSN on the recovery master at the end of
+the recovery. Now clients acting directly on the Recovery master will
+directly change a record's content on the recmaster without migration
+and hence without RSN bump.  So a subsequent recovery can not tell that
+the recmaster's copy is newer than the copies on the other nodes, since
+their RSN is the same. Hence, if the recmaster is not node 0 (or more
+precisely not the active node with the lowest node number), the recovery
+will choose copies from nodes with lower number and stick to these.
+
+Steps:
+
+1. Create a test database
+2. Add a record with value value1 on recovery master
+3. Force a recovery
+4. Update the record with value value2 on recovery master
+5. Force a recovery
+6. Fetch the record
+
+Expected results:
+
+* The record should have value value2 and not value1
+
+EOF
+}
+
+. "${TEST_SCRIPTS_DIR}/integration.bash"
+
+ctdb_test_init "$@"
+
+set -e
+
+cluster_is_healthy
+
+# Reset configuration
+ctdb_restart_when_done
+
+#
+# Main test
+#
+TESTDB="rec_test.tdb"
+
+status=0
+
+# Make sure node 0 is not the recovery master
+echo "find out which node is recmaster"
+try_command_on_node -q any $CTDB_TEST_WRAPPER ctdb recmaster
+recmaster="$out"
+if [ "$recmaster" = "0" ]; then
+    echo "node 0 is recmaster, disable recmasterrole on node 0"
+    #
+    # Note:
+    # It should be sufficient to run "ctdb setrecmasterrole off"
+    # on node 0 and wait for election and recovery to finish.
+    # But there were problems related to this in this automatic
+    # test, so for now use "ctdb stop" and "ctdb continue".
+    #
+    echo "stop node 0"
+    try_command_on_node -q 0 $CTDB_TEST_WRAPPER ctdb stop
+    wait_until_node_has_status 0 stopped
+    echo "continue node 0"
+    try_command_on_node -q 0 $CTDB_TEST_WRAPPER ctdb continue
+    wait_until_node_has_status 0 notstopped
+
+    try_command_on_node -q any $CTDB_TEST_WRAPPER ctdb recmaster
+    recmaster="$out"
+    if [ "$recmaster" = "0" ]; then
+	echo "failed to move recmaster to different node"
+	exit 1
+    fi
+fi
+
+echo "Recmaster:$recmaster"
+
+# Create a temporary non-persistent database to test with
+echo "create test database $TESTDB"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb attach $TESTDB
+
+# Wipe Test database
+echo "wipe test database"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb wipedb $TESTDB
+
+# Add a record   key=test1 data=value1
+echo "store key(test1) data(value1)"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb writekey $TESTDB test1 value1
+
+# Fetch a record   key=test1
+echo "read key(test1)"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1
+echo "$out"
+
+# Do a recovery
+echo "force recovery"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb recover
+
+wait_until_node_has_status $recmaster recovered
+
+# Add a record   key=test1 data=value2
+echo "store key(test1) data(value2)"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb writekey $TESTDB test1 value2
+
+# Fetch a record   key=test1
+echo "read key(test1)"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1
+echo "$out"
+
+# Do a recovery
+echo "force recovery"
+try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb recover
+
+wait_until_node_has_status $recmaster recovered
+
+# Verify record   key=test1
+echo "read key(test1)"
+try_command_on_node $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1
+echo "$out"
+if [ "$out" = "Data: size:6 ptr:[value2]" ]; then
+	echo "GOOD: Recovery did not corrupt database"
+else
+	echo "BAD: Recovery corrupted database"
+	status=1
+fi
+
+exit $status


-- 
CTDB repository


More information about the samba-cvs mailing list