[SCM] CTDB repository - branch 1.0.114 updated - ctdb-1.0.114.5-23-g527adf2

Thu May 2 09:40:24 MDT 2013

The branch, 1.0.114 has been updated
       via  527adf2f9a809d1d4ebc5d7c655496a510494098 (commit)
       via  9e67fbbe1cba8f3126897e25b12dfc2c6020b0bf (commit)
       via  dc509a9087b0b03d9755839f93fcff4781618cfe (commit)
       via  8ff41568a0ea666744c72fd772db5c9f704ad61d (commit)
       via  64d75afca94d0a59a8d112f2e8d7130c23e5487c (commit)
       via  0adfa7454fcd1bd17108ef4ec43454f5466a2f19 (commit)
       via  640135d72e08480a433b34d901a2af4c300b7709 (commit)
       via  262935daa73e38d157513dd4351b7fc8caff6405 (commit)
       via  3b53c943a0d7d72068d0eff582964b5f23c22629 (commit)
       via  3c9b6e11b2051270bd02f15d06df36dc6151f2f8 (commit)
       via  734880e16aa33a337c64e3f32e92a114f7ab4196 (commit)
       via  73643aa5878c3bb8dbd171da0d23f71ac3ac897e (commit)
       via  eec5648841efbf68a58109d4f649a2827a7401ff (commit)
       via  2b0d7cb9a7dd0c154339cea71e8a6e23b8cc8fea (commit)
       via  152c23b3891a90e7a608922f41f23a9f2ca55df2 (commit)
       via  697a2711d2086760687ea3d2e4d13957a524da9d (commit)
       via  e1c1ee5a091185e8318c6a095019aaea60b44789 (commit)
       via  d7bc3313e03e180731aaef688d03daae1b027d79 (commit)
       via  8fae01c6fa1f9d8ba2996f5a22df4811052d3a22 (commit)
      from  d85f7f14572924ed45127964723f0924c3c20400 (commit)

http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=1.0.114


- Log -----------------------------------------------------------------
commit 527adf2f9a809d1d4ebc5d7c655496a510494098
Author: Michael Adam <obnox at samba.org>
Date:   Fri Apr 26 17:22:16 2013 +0200

    New version 1.0.114.6

commit 9e67fbbe1cba8f3126897e25b12dfc2c6020b0bf
Author: Michael Adam <obnox at samba.org>
Date:   Fri Feb 22 16:12:17 2013 +0100

    vacuum: Update (C)
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 61264debba58355b9716ac1637fdedef5ed249c8)

commit dc509a9087b0b03d9755839f93fcff4781618cfe
Author: Michael Adam <obnox at samba.org>
Date:   Sat Dec 29 17:23:27 2012 +0100

    vacuum: extend the header comment for ctdb_process_delete_list()
    
    Describe the (new) process more precisely.
    And mention that is the last step of the vacuuming process
    that is performed on the lmaster.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 06de786c786f1cab4c6721adf47c2cb1e8a72adb)

commit 8ff41568a0ea666744c72fd772db5c9f704ad61d
Author: Michael Adam <obnox at samba.org>
Date:   Sat Jan 5 01:20:18 2013 +0100

    vacuum: turn the vacuuming on lmaster into a three-phase process.
    
    More precisely, before locally deleting an empty record, that has been
    migrated with data and that we are dmaster and laster for, we now perform
    the deletion on the other nodes in two steps instead of a single step.
    
    - First send out the list of records to be deleted to all
      other nodes with the new RECEIVE_RECORDS control to store
      the lmaster's current empty copy.
    - Then send those records that could be deleted on all nodes
      to all nodes again with the TRY_DELETE_RECORDS control
      as before for deletion.
    - Finally delete those records locally that were successfully
      deleted remotely in the previous step.
    
    This fixes an old race where a recovery that hits the vacuum process
    square between the eyes can create gaps in the record's history and
    hence let the records resurrect. In the case of the locking.tdb,
    that could mean that a file that was already closed, was recorded as
    being open and locked again, so samba clients were locked out of that
    file until samba was restarted.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit eee23d44b6427be8ab49bbfcee3abb62f37dfcc7)

commit 64d75afca94d0a59a8d112f2e8d7130c23e5487c
Author: Michael Adam <obnox at samba.org>
Date:   Fri Dec 21 00:24:47 2012 +0100

    vacuum: introduce the RECEIVE_RECORDS control
    
    This in preparation of turning the vacuming on the lmaster into
    into a two phase process:
    
    - First the node sends the list of records to be vacuumed
      to all other nodes with this new RECEIVE_RECORDS control.
      The remote nodes should store the lmaster's empty current copy.
    - Only those records that could be stored on all other nodes
      are processed further. They are send to all other nodes with
      the TRY_DELETE_RECORDS control as before for deletion.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit e397702e271af38204fd99733bbeba7c1db3a999)
    
    Conflicts:
    
    	include/ctdb_protocol.h
    	server/ctdb_control.c

commit 0adfa7454fcd1bd17108ef4ec43454f5466a2f19
Author: Michael Adam <obnox at samba.org>
Date:   Sat Dec 29 18:32:39 2012 +0100

    vacuum: reorder some of ctdb_process_delete_list() more intuitively
    
    Now that the nodemap and its talloc children don't hang off of the
    delete_records_list talloc context, we can build the nodemap
    and earlier, and move the construction of the delete_records_list
    to where it is more obvious what it is used for.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit e3740899c1af6962f93c85ad7d1cb71bddce45c6)

commit 640135d72e08480a433b34d901a2af4c300b7709
Author: Michael Adam <obnox at samba.org>
Date:   Sat Dec 29 17:16:33 2012 +0100

    vacuum: add explicit temporary memory context to ctdb_process_delete_list()
    
    This removes the implicit artificial talloc hierarchy and makes the
    code easier to understand.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit b7c3b8cdf92c597e621e3dae28b110d321de5ea8)

commit 262935daa73e38d157513dd4351b7fc8caff6405
Author: Michael Adam <obnox at samba.org>
Date:   Sat Jan 5 01:19:06 2013 +0100

    vacuum: fix indentation in ctdb_process_delete_list()
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 59a887e12469266e514ad7d4e34810e7ea888ba3)

commit 3b53c943a0d7d72068d0eff582964b5f23c22629
Author: Michael Adam <obnox at samba.org>
Date:   Mon Dec 17 17:31:55 2012 +0100

    vacuum: free temporary allocated memory correctly in ctdb_process_delete_list().
    
    Add a common exit point for cleanup.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 11d728465a9c635e1829abaae17e2f7720433b69)

commit 3c9b6e11b2051270bd02f15d06df36dc6151f2f8
Author: Michael Adam <obnox at samba.org>
Date:   Mon Dec 17 17:26:22 2012 +0100

    vacuum: move variable into scope of use in ctdb_process_delete_list()
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 3710dd0f313f551f1b302b4961e0203243e3d661)

commit 734880e16aa33a337c64e3f32e92a114f7ab4196
Author: Michael Adam <obnox at samba.org>
Date:   Mon Dec 17 13:07:21 2012 +0100

    vacuum: move variable into scope of use in ctdb_process_delete_list()
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 4640979b526b6dac69a6a0555bfce75fe0206dac)

commit 73643aa5878c3bb8dbd171da0d23f71ac3ac897e
Author: Michael Adam <obnox at samba.org>
Date:   Mon Dec 17 13:03:42 2012 +0100

    vacuum: simplify ctdb_process_delete_list(): reduce indentation
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit f3e6e7f8ef22bd70dd2f101d818e2e5ab5ed3cd8)
    
    Conflicts:
    
    	server/ctdb_vacuum.c

commit eec5648841efbf68a58109d4f649a2827a7401ff
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 14:12:27 2013 +0200

    vacuum: add DEBUG to skip conditions in delete_record_traverse()
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 817c77a3d0a3546bf46389cec5f6b54778dd1693)
    
    Conflicts:
    
    	server/ctdb_vacuum.c

commit 2b0d7cb9a7dd0c154339cea71e8a6e23b8cc8fea
Author: Michael Adam <obnox at samba.org>
Date:   Mon Apr 22 10:21:02 2013 -0400

    client: fix ctdb_control() to be able to cope with CTDB_CTRL_FLAG_NOREPLY
    
    This was apparently not used before in this context, and the bug hence
    not detected. It becomes necessary when ctdb_local_schedule_for_deletion()
    is called from a client ctdbd (the vacuuming child), hence needs to send
    the SCHEDULE_FOR_DELETION control to its parent.
    
    Pair-Programmed-With: Stefan Metzmacher <metze at samba.org>
    
    Signed-off-by: Stefan Metzmacher <metze at samba.org>
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit e72a5e11845fe445baaee4730bb0bea8588ee9e3)

commit 152c23b3891a90e7a608922f41f23a9f2ca55df2
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 12:02:59 2013 +0200

    ctdb_call: don't bump the rsn in ctdb_become_dmaster() any more
    
    This is now done in ctdb_ltdb_store_server(), so this
    extra bump can be spared.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit cad3107b12e8392f786f9a758ee38cf3a3d58538)

commit 697a2711d2086760687ea3d2e4d13957a524da9d
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 11:40:25 2013 +0200

    Fix a severe recovery bug that can lead to data corruption for SMB clients.
    
    Problem:
    Recovery can under certain circumstances lead to old record copies
    resurrecting: Recovery selects the newest record copy purely by RSN. At
    the end of the recovery, the recovery master is the dmaster for all
    records in all (non-persistent) databases. And the other nodes locally
    hold the complete copy of the databases. The bug is that the recovery
    process does not increment the RSN on the recovery master at the end of
    the recovery. Now clients acting directly on the Recovery master will
    directly change a record's content on the recmaster without migration
    and hence without RSN bump.  So a subsequent recovery can not tell that
    the recmaster's copy is newer than the copies on the other nodes, since
    their RSN is the same. Hence, if the recmaster is not node 0 (or more
    precisely not the active node with the lowest node number), the recovery
    will choose copies from nodes with lower number and stick to these.
    
    Here is how to reproduce:
    
    - assume we have a cluster with at least 2 nodes
    - ensure that the recmaster is not node 0
      (maybe ensure with "onnode 0 ctdb setrecmasterrole off")
      say recmaster is node 1
    - choose a new database name, say "test1.tdb"
      (make sure it is not yet attached as persistent)
    - choose a key name, say "key1"
    - all clustere nodes should ok and no recovery running
    - now do the following on node 1:
    
    1. dbwrap_tool test1.tdb store key1 uint32 1
    2. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 1
    3. ctdb recover
    4. dbwrap_tool test1.tdb store key1 uint32 2
    5. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 2
    4. ctdb recover
    7. dbwrap_tool test1.tdb fetch key1 uint32
       ==> 1
       ==> BUG
    
    This is a very severe bug, since when applied to Samba's locking.tdb
    database, it means that for SMB clients on clustered Samba there is
    the potential for locking out oneself from previously opened files
    or even worse, data corruption:
    
    Case 1: locking out
    
    - client on recmaster opens file
    - recovery propagates open file handle (entry in locking.tdb) to
      other nodes
    - client closes file
    - client opens the same file
    - recovery resurrects old copy of open file record in locking.tdb
      from lower node
    - client closes file but fails to delete entry in locking.tdb
    - client tries to open same file again but fails, since
      the old record locks it out (since the client is still connected)
    
    Case 2: data corruption
    
    - clien1 on recmaster opens file
    - recovery propagates open file info to other nodes
    - client1 closes the file and disconnects
    - client2 opens the same file
    - recovery resurrects old copy of locking.tdb record,
      where client2 has no entry, but client1 has.
    - but client2 believes it still has a handle
    - client3 opens the file and succees without
      conflicting with client2
      (the detached entry for client1 is discarded because
       the server does not exist any more).
    => both client2 and client3 believe they have exclusive
      access to the file and writing creates data corruption
    
    Fix:
    
    When storing a record on the dmaster, bump its RSN.
    
    The ctdb_ltdb_store_server() is the central function for storing
    a record to a local tdb from the ctdbd server context.
    So this is also the place where the RSN of the record to be stored
    should be incremented, when storing on the dmaster.
    
    For the case of the record migration, this is currently done in
    ctdb_become_dmaster() in ctdb_call.c, but there are other places
    such as in recovery, where we should bump the RSN, but currently
    don't do it.
    
    So moving the RSN incrementation into ctdb_ltdb_store_server fixes
    the recovery-record-resurrection bug.
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-By: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit feb1d40b21a160737aead22e398f3c34ff3be8de)
    
    Conflicts:
    
    	server/ctdb_ltdb_server.c

commit e1c1ee5a091185e8318c6a095019aaea60b44789
Author: Michael Adam <obnox at samba.org>
Date:   Mon Apr 15 12:50:42 2013 +0200

    logging: fix comment typo
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 4c0cbfbe8b19f2e6fe17093b52c734bec63dd8b7)

commit d7bc3313e03e180731aaef688d03daae1b027d79
Author: Michael Adam <obnox at samba.org>
Date:   Wed Apr 3 14:03:32 2013 +0200

    ctdbd: unimplement the unused SET_DMASTER control
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 2e92deef5221ee651028ef87138b3113f1fece91)
    
    Conflicts:
    
    	include/ctdb_protocol.h
    	server/ctdb_recover.c

commit 8fae01c6fa1f9d8ba2996f5a22df4811052d3a22
Author: Michael Adam <obnox at samba.org>
Date:   Fri Mar 22 17:48:00 2013 +0100

    recoverd: remove bogus comment "qqq" from "add prototype new banning code"
    
    Signed-off-by: Michael Adam <obnox at samba.org>
    Reviewed-by: Amitay Isaacs <amitay at gmail.com>
    (cherry picked from commit 9f01b8db72780acf2f88f1392bc0a796dd4c6176)

-----------------------------------------------------------------------

Summary of changes:
 client/ctdb_client.c       |   11 +
 include/ctdb_private.h     |    6 +-
 packaging/RPM/ctdb.spec.in |    9 +-
 server/ctdb_call.c         |    2 +-
 server/ctdb_control.c      |    8 +-
 server/ctdb_logging.c      |    2 +-
 server/ctdb_ltdb_server.c  |    9 +-
 server/ctdb_recover.c      |  235 +++++++++++++++-----
 server/ctdb_recoverd.c     |    1 -
 server/ctdb_vacuum.c       |  552 +++++++++++++++++++++++++++++++++-----------
 10 files changed, 642 insertions(+), 193 deletions(-)


Changeset truncated at 500 lines:

diff --git a/client/ctdb_client.c b/client/ctdb_client.c
index 94fc712..c1b79af 100644
--- a/client/ctdb_client.c
+++ b/client/ctdb_client.c
@@ -926,6 +926,17 @@ int ctdb_control(struct ctdb_context *ctdb, uint32_t destnode, uint64_t srvid,
 	state = ctdb_control_send(ctdb, destnode, srvid, opcode, 
 			flags, data, mem_ctx,
 			timeout, errormsg);
+
+	/* FIXME: Error conditions in ctdb_control_send return NULL without
+	 * setting errormsg.  So, there is no way to distinguish between sucess
+	 * and failure when CTDB_CTRL_FLAG_NOREPLY is set */
+	if (flags & CTDB_CTRL_FLAG_NOREPLY) {
+		if (status != NULL) {
+			*status = 0;
+		}
+		return 0;
+	}
+
 	return ctdb_control_recv(ctdb, state, mem_ctx, outdata, status, 
 			errormsg);
 }
diff --git a/include/ctdb_private.h b/include/ctdb_private.h
index 9c54e62..a2af9bb 100644
--- a/include/ctdb_private.h
+++ b/include/ctdb_private.h
@@ -545,7 +545,7 @@ enum ctdb_controls {CTDB_CONTROL_PROCESS_EXISTS          = 0,
 		    CTDB_CONTROL_SET_DEBUG               = 8,
 		    CTDB_CONTROL_GET_DBMAP               = 9,
 		    CTDB_CONTROL_GET_NODEMAPv4           = 10, /* obsolete */
-		    CTDB_CONTROL_SET_DMASTER             = 11,
+		    CTDB_CONTROL_SET_DMASTER             = 11, /* obsolete */
 		    /* #12 removed */
 		    CTDB_CONTROL_PULL_DB                 = 13,
 		    CTDB_CONTROL_PUSH_DB                 = 14,
@@ -659,6 +659,8 @@ enum ctdb_controls {CTDB_CONTROL_PROCESS_EXISTS          = 0,
 		    CTDB_CONTROL_SCHEDULE_FOR_DELETION   = 128,
 		    /* 129 & 130: skipped (master) */
 		    CTDB_CONTROL_TRAVERSE_START_EXT	 = 131,
+		    /* 132, 133, 134, 135 skipped (master) */
+		    CTDB_CONTROL_RECEIVE_RECORDS	 = 136,
 };
 
 /*
@@ -1264,7 +1266,6 @@ struct ctdb_rec_data *ctdb_marshall_loop_next(struct ctdb_marshall_buffer *m, st
 
 int32_t ctdb_control_pull_db(struct ctdb_context *ctdb, TDB_DATA indata, TDB_DATA *outdata);
 int32_t ctdb_control_push_db(struct ctdb_context *ctdb, TDB_DATA indata);
-int32_t ctdb_control_set_dmaster(struct ctdb_context *ctdb, TDB_DATA indata);
 
 int32_t ctdb_control_set_recmode(struct ctdb_context *ctdb, 
 				 struct ctdb_req_control *c,
@@ -1461,6 +1462,7 @@ int32_t ctdb_control_get_tunable(struct ctdb_context *ctdb, TDB_DATA indata,
 int32_t ctdb_control_set_tunable(struct ctdb_context *ctdb, TDB_DATA indata);
 int32_t ctdb_control_list_tunables(struct ctdb_context *ctdb, TDB_DATA *outdata);
 int32_t ctdb_control_try_delete_records(struct ctdb_context *ctdb, TDB_DATA indata, TDB_DATA *outdata);
+int32_t ctdb_control_receive_records(struct ctdb_context *ctdb, TDB_DATA indata, TDB_DATA *outdata);
 int32_t ctdb_control_add_public_address(struct ctdb_context *ctdb, TDB_DATA indata);
 int32_t ctdb_control_del_public_address(struct ctdb_context *ctdb, TDB_DATA indata);
 
diff --git a/packaging/RPM/ctdb.spec.in b/packaging/RPM/ctdb.spec.in
index 1c1f00f..3ad669a 100644
--- a/packaging/RPM/ctdb.spec.in
+++ b/packaging/RPM/ctdb.spec.in
@@ -4,7 +4,7 @@ Summary: Clustered TDB
 Vendor: Samba Team
 Packager: Samba Team <samba at samba.org>
 Name: ctdb
-Version: 1.0.114.5
+Version: 1.0.114.6
 Release: 1GITHASH
 Epoch: 0
 License: GNU GPL version 3
@@ -127,6 +127,13 @@ rm -rf $RPM_BUILD_ROOT
 %{_docdir}/ctdb/tests/bin/ctdb_transaction
 
 %changelog
+* Mon Apr 29 2013 : Version 1.0.114.6
+ - Michael Adam: fix data corruption bug (by record resurrection) in
+   recovery code (backported from master)
+ - Michael Adam: fix race condition data corruption bug (by record
+   resurrection) in vacuum code (backported from master)
+ - Michael Adam: some typo fixes (backported from master)
+ - Martin Schwenke: Error propagation fix in ctdb_ltdb_store_server()
 * Mon May 07 2012 : Version 1.0.114.5
  - Rusty Russell: wrap iptables in flock to avoid concurrancy
    (backported from master)
diff --git a/server/ctdb_call.c b/server/ctdb_call.c
index eb8d93c..4812776 100644
--- a/server/ctdb_call.c
+++ b/server/ctdb_call.c
@@ -277,7 +277,7 @@ static void ctdb_become_dmaster(struct ctdb_db_context *ctdb_db,
 	DEBUG(DEBUG_DEBUG,("pnn %u dmaster response %08x\n", ctdb->pnn, ctdb_hash(&key)));
 
 	ZERO_STRUCT(header);
-	header.rsn = rsn + 1;
+	header.rsn = rsn;
 	header.dmaster = ctdb->pnn;
 	header.flags = record_flags;
 
diff --git a/server/ctdb_control.c b/server/ctdb_control.c
index 69d61c1..0a3c761 100644
--- a/server/ctdb_control.c
+++ b/server/ctdb_control.c
@@ -156,7 +156,9 @@ static int32_t ctdb_control_dispatch(struct ctdb_context *ctdb,
 
 	case CTDB_CONTROL_SET_DMASTER: 
 		CHECK_CONTROL_DATA_SIZE(sizeof(struct ctdb_control_set_dmaster));
-		return ctdb_control_set_dmaster(ctdb, indata);
+		DEBUG(DEBUG_ERR, ("The SET_DMASTER control is not implemented "
+				  "any more.\n"));
+		return  -1;
 
 	case CTDB_CONTROL_PUSH_DB:
 		return ctdb_control_push_db(ctdb, indata);
@@ -593,6 +595,10 @@ static int32_t ctdb_control_dispatch(struct ctdb_context *ctdb,
 		CHECK_CONTROL_DATA_SIZE(size);
 		return ctdb_control_schedule_for_deletion(ctdb, indata);
 	}
+
+	case CTDB_CONTROL_RECEIVE_RECORDS:
+		return ctdb_control_receive_records(ctdb, indata, outdata);
+
 	default:
 		DEBUG(DEBUG_CRIT,(__location__ " Unknown CTDB control opcode %u\n", opcode));
 		return -1;
diff --git a/server/ctdb_logging.c b/server/ctdb_logging.c
index a7ca1a1..7881c2f 100644
--- a/server/ctdb_logging.c
+++ b/server/ctdb_logging.c
@@ -66,7 +66,7 @@ static void ctdb_syslog_handler(struct event_context *ev, struct fd_event *fde,
 }
 
 
-/* called when the pipd from the main daemon has closed
+/* called when the pipe from the main daemon has closed
  * this is for the syslog daemon, we can not use DEBUG here
  */
 static void ctdb_syslog_terminate_handler(struct event_context *ev, struct fd_event *fde, 
diff --git a/server/ctdb_ltdb_server.c b/server/ctdb_ltdb_server.c
index 275f6c6..7cc06fc 100644
--- a/server/ctdb_ltdb_server.c
+++ b/server/ctdb_ltdb_server.c
@@ -142,11 +142,14 @@ static int ctdb_ltdb_store_server(struct ctdb_db_context *ctdb_db,
 	}
 
 	if (keep) {
-		if ((data.dsize == 0) &&
-		    !ctdb_db->persistent &&
+		if (!ctdb_db->persistent &&
 		    (ctdb_db->ctdb->pnn == header->dmaster))
 		{
-			schedule_for_deletion = true;
+			header->rsn++;
+
+			if (data.dsize == 0) {
+				schedule_for_deletion = true;
+			}
 		}
 		remove_from_delete_queue = !schedule_for_deletion;
 	}
diff --git a/server/ctdb_recover.c b/server/ctdb_recover.c
index 537c4ea..9d360c7 100644
--- a/server/ctdb_recover.c
+++ b/server/ctdb_recover.c
@@ -485,59 +485,6 @@ failed:
 	return -1;
 }
 
-
-static int traverse_setdmaster(struct tdb_context *tdb, TDB_DATA key, TDB_DATA data, void *p)
-{
-	uint32_t *dmaster = (uint32_t *)p;
-	struct ctdb_ltdb_header *header = (struct ctdb_ltdb_header *)data.dptr;
-	int ret;
-
-	/* skip if already correct */
-	if (header->dmaster == *dmaster) {
-		return 0;
-	}
-
-	header->dmaster = *dmaster;
-
-	ret = tdb_store(tdb, key, data, TDB_REPLACE);
-	if (ret) {
-		DEBUG(DEBUG_CRIT,(__location__ " failed to write tdb data back  ret:%d\n",ret));
-		return ret;
-	}
-
-	/* TODO: add error checking here */
-
-	return 0;
-}
-
-int32_t ctdb_control_set_dmaster(struct ctdb_context *ctdb, TDB_DATA indata)
-{
-	struct ctdb_control_set_dmaster *p = (struct ctdb_control_set_dmaster *)indata.dptr;
-	struct ctdb_db_context *ctdb_db;
-
-	ctdb_db = find_ctdb_db(ctdb, p->db_id);
-	if (!ctdb_db) {
-		DEBUG(DEBUG_ERR,(__location__ " Unknown db 0x%08x\n", p->db_id));
-		return -1;
-	}
-
-	if (ctdb->freeze_mode[ctdb_db->priority] != CTDB_FREEZE_FROZEN) {
-		DEBUG(DEBUG_DEBUG,("rejecting ctdb_control_set_dmaster when not frozen\n"));
-		return -1;
-	}
-
-	if (ctdb_lock_all_databases_mark(ctdb, 	ctdb_db->priority) != 0) {
-		DEBUG(DEBUG_ERR,(__location__ " Failed to get lock on entired db - failing\n"));
-		return -1;
-	}
-
-	tdb_traverse(ctdb_db->ltdb->tdb, traverse_setdmaster, &p->dmaster);
-
-	ctdb_lock_all_databases_unmark(ctdb, ctdb_db->priority);
-	
-	return 0;
-}
-
 struct ctdb_set_recmode_state {
 	struct ctdb_context *ctdb;
 	struct ctdb_req_control *c;
@@ -1131,6 +1078,188 @@ int32_t ctdb_control_try_delete_records(struct ctdb_context *ctdb, TDB_DATA inda
 	return 0;
 }
 
+/**
+ * Store a record as part of the vacuum process:
+ * This is called from the RECEIVE_RECORD control which
+ * the lmaster uses to send the current empty copy
+ * to all nodes for storing, before it lets the other
+ * nodes delete the records in the second phase with
+ * the TRY_DELETE_RECORDS control.
+ *
+ * Only store if we are not lmaster or dmaster, and our
+ * rsn is <= the provided rsn. Use non-blocking locks.
+ *
+ * return 0 if the record was successfully stored.
+ * return !0 if the record still exists in the tdb after returning.
+ */
+static int store_tdb_record(struct ctdb_context *ctdb,
+			    struct ctdb_db_context *ctdb_db,
+			    struct ctdb_rec_data *rec)
+{
+	TDB_DATA key, data, data2;
+	struct ctdb_ltdb_header *hdr, *hdr2;
+	int ret;
+
+	key.dsize = rec->keylen;
+	key.dptr = &rec->data[0];
+	data.dsize = rec->datalen;
+	data.dptr = &rec->data[rec->keylen];
+
+	if (ctdb_lmaster(ctdb, &key) == ctdb->pnn) {
+		DEBUG(DEBUG_INFO, (__location__ " Called store_tdb_record "
+				   "where we are lmaster\n"));
+		return -1;
+	}
+
+	if (data.dsize != sizeof(struct ctdb_ltdb_header)) {
+		DEBUG(DEBUG_ERR, (__location__ " Bad record size\n"));
+		return -1;
+	}
+
+	hdr = (struct ctdb_ltdb_header *)data.dptr;
+
+	/* use a non-blocking lock */
+	if (tdb_chainlock_nonblock(ctdb_db->ltdb->tdb, key) != 0) {
+		DEBUG(DEBUG_ERR, (__location__ " Failed to lock chain\n"));
+		return -1;
+	}
+
+	data2 = tdb_fetch(ctdb_db->ltdb->tdb, key);
+	if (data2.dptr == NULL || data2.dsize < sizeof(struct ctdb_ltdb_header)) {
+		tdb_store(ctdb_db->ltdb->tdb, key, data, 0);
+		DEBUG(DEBUG_INFO, (__location__ " Stored record\n"));
+		ret = 0;
+		goto done;
+	}
+
+	hdr2 = (struct ctdb_ltdb_header *)data.dptr;
+
+	if (hdr2->rsn > hdr->rsn) {
+		DEBUG(DEBUG_INFO, (__location__ " Skipping record with "
+				   "rsn=%llu - called with rsn=%llu\n",
+				   (unsigned long long)hdr2->rsn,
+				   (unsigned long long)hdr->rsn));
+		ret = -1;
+		goto done;
+	}
+
+	if (hdr2->dmaster == ctdb->pnn) {
+		DEBUG(DEBUG_INFO, (__location__ " Attempted to store record "
+				   "where we are the dmaster\n"));
+		ret = -1;
+		goto done;
+	}
+
+	if (tdb_store(ctdb_db->ltdb->tdb, key, data, 0) != 0) {
+		DEBUG(DEBUG_INFO,(__location__ " Failed to store record\n"));
+		ret = -1;
+		goto done;
+	}
+
+	ret = 0;
+
+done:
+	tdb_chainunlock(ctdb_db->ltdb->tdb, key);
+	free(data2.dptr);
+	return  ret;
+}
+
+
+
+/**
+ * Try to store all these records as part of the vacuuming process
+ * and return the records we failed to store.
+ */
+int32_t ctdb_control_receive_records(struct ctdb_context *ctdb,
+				     TDB_DATA indata, TDB_DATA *outdata)
+{
+	struct ctdb_marshall_buffer *reply = (struct ctdb_marshall_buffer *)indata.dptr;
+	struct ctdb_db_context *ctdb_db;
+	int i;
+	struct ctdb_rec_data *rec;
+	struct ctdb_marshall_buffer *records;
+
+	if (indata.dsize < offsetof(struct ctdb_marshall_buffer, data)) {
+		DEBUG(DEBUG_ERR,
+		      (__location__ " invalid data in receive_records\n"));
+		return -1;
+	}
+
+	ctdb_db = find_ctdb_db(ctdb, reply->db_id);
+	if (!ctdb_db) {
+		DEBUG(DEBUG_ERR, (__location__ " Unknown db 0x%08x\n",
+				  reply->db_id));
+		return -1;
+	}
+
+	DEBUG(DEBUG_DEBUG, ("starting receive_records of %u records for "
+			    "dbid 0x%x\n", reply->count, reply->db_id));
+
+	/* create a blob to send back the records we could not store */
+	records = (struct ctdb_marshall_buffer *)
+			talloc_zero_size(outdata,
+				offsetof(struct ctdb_marshall_buffer, data));
+	if (records == NULL) {
+		DEBUG(DEBUG_ERR, (__location__ " Out of memory\n"));
+		return -1;
+	}
+	records->db_id = ctdb_db->db_id;
+
+	rec = (struct ctdb_rec_data *)&reply->data[0];
+	for (i=0; i<reply->count; i++) {
+		TDB_DATA key, data;
+
+		key.dptr = &rec->data[0];
+		key.dsize = rec->keylen;
+		data.dptr = &rec->data[key.dsize];
+		data.dsize = rec->datalen;
+
+		if (data.dsize < sizeof(struct ctdb_ltdb_header)) {
+			DEBUG(DEBUG_CRIT, (__location__ " bad ltdb record "
+					   "in indata\n"));
+			return -1;
+		}
+
+		/*
+		 * If we can not store the record we must add it to the reply
+		 * so the lmaster knows it may not purge this record.
+		 */
+		if (store_tdb_record(ctdb, ctdb_db, rec) != 0) {
+			size_t old_size;
+			struct ctdb_ltdb_header *hdr;
+
+			hdr = (struct ctdb_ltdb_header *)data.dptr;
+			data.dptr += sizeof(*hdr);
+			data.dsize -= sizeof(*hdr);
+
+			DEBUG(DEBUG_INFO, (__location__ " Failed to store "
+					   "record with hash 0x%08x in vacuum "
+					   "via RECEIVE_RECORDS\n",
+					   ctdb_hash(&key)));
+
+			old_size = talloc_get_size(records);
+			records = talloc_realloc_size(outdata, records,
+						      old_size + rec->length);
+			if (records == NULL) {
+				DEBUG(DEBUG_ERR, (__location__ " Failed to "
+						  "expand\n"));
+				return -1;
+			}
+			records->count++;
+			memcpy(old_size+(uint8_t *)records, rec, rec->length);
+		}
+
+		rec = (struct ctdb_rec_data *)(rec->length + (uint8_t *)rec);
+	}
+
+
+	outdata->dptr = (uint8_t *)records;
+	outdata->dsize = talloc_get_size(records);
+
+	return 0;
+}
+
+
 /*
   report capabilities
  */
diff --git a/server/ctdb_recoverd.c b/server/ctdb_recoverd.c
index 93af64e..9bd7e95 100644
--- a/server/ctdb_recoverd.c
+++ b/server/ctdb_recoverd.c
@@ -3053,7 +3053,6 @@ again:
 	/* check that we (recovery daemon) and the local ctdb daemon
 	   agrees on whether we are banned or not
 	*/
-//qqq
 
 	/* remember our own node flags */
 	rec->node_flags = nodemap->nodes[pnn].flags;
diff --git a/server/ctdb_vacuum.c b/server/ctdb_vacuum.c
index 89e261a..a81323c 100644
--- a/server/ctdb_vacuum.c
+++ b/server/ctdb_vacuum.c
@@ -2,7 +2,7 @@
    ctdb vacuuming events
 
    Copyright (C) Ronnie Sahlberg  2009
-   Copyright (C) Michael Adam 2010-2011
+   Copyright (C) Michael Adam 2010-2013
    Copyright (C) Stefan Metzmacher 2010-2011
 
    This program is free software; you can redistribute it and/or modify
@@ -98,6 +98,7 @@ struct delete_record_data {
 
 struct delete_records_list {
 	struct ctdb_marshall_buffer *records;
+	struct vacuum_data *vdata;
 };
 
 /**
@@ -305,6 +306,133 @@ static int delete_marshall_traverse(void *param, void *data)
 }
 
 /**
+ * Variant of delete_marshall_traverse() that bumps the
+ * RSN of each traversed record in the database.
+ *
+ * This is needed to ensure that when rolling out our
+ * empty record copy before remote deletion, we as the
+ * record's dmaster keep a higher RSN than the non-dmaster
+ * nodes. This is needed to prevent old copies from
+ * resurrection in recoveries.
+ */
+static int delete_marshall_traverse_first(void *param, void *data)
+{
+	struct delete_record_data *dd = talloc_get_type(data, struct delete_record_data);
+	struct delete_records_list *recs = talloc_get_type(param, struct delete_records_list);
+	struct ctdb_db_context *ctdb_db = dd->ctdb_db;
+	struct ctdb_context *ctdb = ctdb_db->ctdb;
+	struct ctdb_ltdb_header *header;
+	TDB_DATA tdb_data, ctdb_data;
+	uint32_t lmaster;
+	uint32_t hash = ctdb_hash(&(dd->key));
+	int res;
+
+	res = tdb_chainlock(ctdb_db->ltdb->tdb, dd->key);
+	if (res != 0) {
+		DEBUG(DEBUG_ERR,
+		      (__location__ " Error getting chainlock on record with "
+		       "key hash [0x%08x] on database db[%s].\n",
+		       hash, ctdb_db->db_name));
+		recs->vdata->delete_skipped++;
+		talloc_free(dd);
+		return 0;
+	}
+
+	/*
+	 * Verify that the record is still empty, its RSN has not
+	 * changed and that we are still its lmaster and dmaster.
+	 */
+
+	tdb_data = tdb_fetch(ctdb_db->ltdb->tdb, dd->key);
+	if (tdb_data.dsize < sizeof(struct ctdb_ltdb_header)) {
+		DEBUG(DEBUG_INFO, (__location__ ": record with hash [0x%08x] "
+				   "on database db[%s] does not exist or is not"
+				   " a ctdb-record.  skipping.\n",
+				   hash, ctdb_db->db_name));
+		goto skip;
+	}
+
+	if (tdb_data.dsize > sizeof(struct ctdb_ltdb_header)) {
+		DEBUG(DEBUG_INFO, (__location__ ": record with hash [0x%08x] "
+				   "on database db[%s] has been recycled. "


-- 
CTDB repository