RFC: dbwrap_ctdb and empty vs deleted records

Mon Aug 8 10:38:41 UTC 2016

Hi Ralph,

after thinking for quite some time about this, and
looking through the code with increasingly watery
eyes, I have a few comments:

In summary, I think the patch is wrong in this exact
form. But I have a proposal of a modification that
should work always.

See the details below:


> From 12afa6c3febf3fb70ba0a207253b6b4bf77a439b Mon Sep 17 00:00:00 2001
> From: Ralph Boehme <slow at samba.org>
> Date: Fri, 1 Jul 2016 17:21:04 +0200
> Subject: [PATCH 2/3] dbwrap_ctdb: check for ltdb tombstone records
> 
> When fetching records from remote ctdb nodes via ctdbd_parse() or in
> db_ctdb_traverse(), we already checks for tombstone records and skip
> them. Add the same check to db_ctdb_ltdb_parser() when fetching records
> from the ltbd.
> 
> See also bug: https://bugzilla.samba.org/show_bug.cgi?id=10008
> 
> Commit 925625b52886d40b50fc631bad8bdc81970f7598 reverted part of the
> patch in bug 10008 due to a deadlock it introduced.
> 
> This patch avoids the deadlock because it ensures db_ctdb_parse_record()
> returns NT_STATUS_NOT_FOUND if an empty record is found in the ltdb and
> not calling ctdb.
> 
> Bug: https://bugzilla.samba.org/show_bug.cgi?id=12005
> 
> Signed-off-by: Ralph Boehme <slow at samba.org>
> ---
>  source3/lib/dbwrap/dbwrap_ctdb.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/source3/lib/dbwrap/dbwrap_ctdb.c b/source3/lib/dbwrap/dbwrap_ctdb.c
> index c45c257..4244d66 100644
> --- a/source3/lib/dbwrap/dbwrap_ctdb.c
> +++ b/source3/lib/dbwrap/dbwrap_ctdb.c
> @@ -79,6 +79,7 @@ struct db_ctdb_ltdb_parse_state {
>  	void (*parser)(TDB_DATA key, struct ctdb_ltdb_header *header,
>  		       TDB_DATA data, void *private_data);
>  	void *private_data;
> +	bool empty_record;
>  };
>  
>  static int db_ctdb_ltdb_parser(TDB_DATA key, TDB_DATA data,
> @@ -91,6 +92,15 @@ static int db_ctdb_ltdb_parser(TDB_DATA key, TDB_DATA data,
>  		return -1;
>  	}
>  
> +	if (data.dsize == sizeof(struct ctdb_ltdb_header)) {
> +		/*
> +		 * Possible record tombstone or just a fresh and empty
> +		 * record
> +		 */
> +		state->empty_record = true;
> +		return -1;
> +	}

This empty_record setting is to differentiate the case that a record
does not exist in the local db and that exists but is empty. So
that the caller of db_ctdb_ltdb_parser can decide whether to take
the NOT_FOUND reply as authoritative or not.

But for this to be authoritative, we need to know that we are the
dmaster (or have a RO copy). This is only checked in the following
parser call and put into the 'done' variable.

So I think this exit is too early to be authoritative.

>  	state->parser(
>  		key, (struct ctdb_ltdb_header *)data.dptr,
>  		make_tdb_data(data.dptr + sizeof(struct ctdb_ltdb_header),
> @@ -105,6 +115,7 @@ struct db_ctdb_parse_record_state {
>  	uint32_t my_vnn;
>  	bool ask_for_readonly_copy;
>  	bool done;
> +	bool empty_record;
>  };
>  
>  static NTSTATUS db_ctdb_ltdb_parse(
> @@ -113,6 +124,8 @@ static NTSTATUS db_ctdb_ltdb_parse(
>  		       TDB_DATA data, void *private_data),
>  	void *private_data)
>  {
> +	struct db_ctdb_parse_record_state *rec_state =
> +		(struct db_ctdb_parse_record_state *)private_data;
>  	struct db_ctdb_ltdb_parse_state state;
>  	int ret;
>  
> @@ -122,6 +135,9 @@ static NTSTATUS db_ctdb_ltdb_parse(
>  	ret = tdb_parse_record(db->wtdb->tdb, key, db_ctdb_ltdb_parser,
>  			       &state);
>  	if (ret == -1) {
> +		if (state.empty_record) {
> +			rec_state->empty_record = true;
> +		}
>  		return NT_STATUS_NOT_FOUND;
>  	}
>  	return NT_STATUS_OK;
> @@ -1267,6 +1283,7 @@ static NTSTATUS db_ctdb_parse_record(struct db_context *db, TDB_DATA key,
>  	state.parser = parser;
>  	state.private_data = private_data;
>  	state.my_vnn = ctdbd_vnn(ctx->conn);
> +	state.empty_record = false;
>  
>  	if (ctx->transaction != NULL) {
>  		struct db_ctdb_transaction_handle *h = ctx->transaction;
> @@ -1301,6 +1318,18 @@ static NTSTATUS db_ctdb_parse_record(struct db_context *db, TDB_DATA key,
>  		return NT_STATUS_OK;
>  	}
>  
> +	if (NT_STATUS_EQUAL(status, NT_STATUS_NOT_FOUND) && state.empty_record) {
> +		/*
> +		 * This is an empty record. ctdbd does not distinguish
> +		 * between empty and deleted records. Samba right now
> +		 * can live without empty records, so lets treat
> +		 * zero-size (i.e. deleted) records as non-existing.
> +		 *
> +		 * See bugs 10008 and 12005.
> +		 */
> +		return NT_STATUS_NOT_FOUND;
> +	}

The subtle main point here is not that an empty record
in the local db is treated as non-existing, but when
to treat this answer authoritative!

This is the local-db shortcut path:
If we don't exit here, we'll ask ctdbd.

At first, I thought that the 'empty_record' variable
is redundant, but then I understood, that here the
distinction is important because non-existence on
the tdb level in the ltdb might still mean that
the record exists on another node.

But if you think this through to the end, you
see that in order to treat this empty_record
as authoritative, the current node needs to
be the dmaster of the record.

So this can create false NOT FOUND states:
The record can have been migrated to the node, deleted/emptied
and then migrated off again. E.g. first brl on this node,
migrated to other node brl added, then removed again, then
brl removed on this node (empty record again), then migrated
off to some other node. Such an empty record that had been migrated
with data will not be pruned when migrating off, but only by
vacuuming.

Alternative patch attached.



More comments on the test case:

> From a921818c476368b27d6e9ac4d002e9ce86fa36a3 Mon Sep 17 00:00:00 2001
> From: Ralph Boehme <slow at samba.org>
> Date: Sat, 23 Jul 2016 11:08:13 +0200
> Subject: [PATCH 3/3] s4/torture: add a test for ctdb-tombstrone-record
>  deadlock
> 
> This tests for a possible deadlock between smbd and ctdb dealing with
> ctdb tombstone records.
> 
> Commit 925625b52886d40b50fc631bad8bdc81970f7598 explains the deadlock in
> more details and contains the fix. It's a fix for a regression
> introduced by the patch for bug 10008 (1cae59ce112c).
> 
> If you ever want to use this test against that specific commit:
> 
> $ git checkout 925625b52886d40b50fc631bad8bdc81970f7598
> $ git cherry-pick THIS_COMMIT
> 
> This should not deadlock on a ctdb cluster.
> 
> $ git revert 925625b52886d40b50fc631bad8bdc81970f7598
> 
> This will deadlock.
> 
> The patches in this bugreport (12005) avoid the deadlock as well.
> 
> Bug: https://bugzilla.samba.org/show_bug.cgi?id=12005
> 
> Signed-off-by: Ralph Boehme <slow at samba.org>
> ---
>  source4/torture/smb2/lock.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/source4/torture/smb2/lock.c b/source4/torture/smb2/lock.c
> index 68e353d..dc1b91b 100644
> --- a/source4/torture/smb2/lock.c
> +++ b/source4/torture/smb2/lock.c
> @@ -3033,6 +3033,51 @@ done:
>  	return ret;
>  }
>  
> +/**
> + * Test lock interaction between smbd and ctdb with tombstone records
> + */
> +static bool test_deadlock(struct torture_context *torture,
> +			  struct smb2_tree *tree)
> +{
> +	NTSTATUS status;
> +	bool ret = true;
> +	struct smb2_handle h;
> +	uint8_t buf[200];
> +	bool correct = true;
> +	const char *fname = BASEDIR "\\deadlock.txt";
> +
> +	if (!lpcfg_clustering(torture->lp_ctx)) {
> +		torture_skip(torture, "Test must be run on a ctdb cluster\n");
> +		return true;
> +	}

I still think that we should not restrict this to clustering.

> +	status = torture_smb2_testdir(tree, BASEDIR, &h);
> +	CHECK_STATUS(status, NT_STATUS_OK);

Try to avoid the test-specific CHECK_STATUS macros if possible.
Use torture_assert* instead.

> +	smb2_util_close(tree, h);
> +
> +	status = torture_smb2_testfile(tree, fname, &h);
> +	CHECK_STATUS(status, NT_STATUS_OK);
> +
> +	ZERO_STRUCT(buf);
> +	status = smb2_util_write(tree, h, buf, 0, ARRAY_SIZE(buf));
> +	CHECK_STATUS(status, NT_STATUS_OK);
> +
> +	ret = NT_STATUS_IS_OK(test_smb2_lock(tree, h, 0, 1, true));
> +	EXPECTED(ret, true);

can't we use torture_assert here as well and exit the test
once we hit a problem instead of carrying on?

> +	ret = NT_STATUS_IS_OK(test_smb2_unlock(tree, h, 0, 1));
> +	EXPECTED(ret, true);
> +
> +	ret = NT_STATUS_IS_OK(test_smb2_lock(tree, h, 0, 1, true));
> +	EXPECTED(ret, true);
> +	ret = NT_STATUS_IS_OK(test_smb2_unlock(tree, h, 0, 1));
> +	EXPECTED(ret, true);
> +
> +done:
> +	smb2_util_close(tree, h);

may need some guards here to prevent on closing non-opened file.

> +	smb2_deltree(tree, BASEDIR);
> +	return correct;
> +}

Apart from the above comments regarding the use of macros,
rb-by me, and great that you put in a test!

Attached a patch to squash with your patch that
lets it use torture_assert macros.

Cheers - Michael
-------------- next part --------------
From 4f19bd51e729874c0dc0a027ab048d595701f20e Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Fri, 1 Jul 2016 17:21:04 +0200
Subject: [PATCH] dbwrap_ctdb: treat tombstone records in ltdb as non-existing

When fetching records from remote ctdb nodes via ctdbd_parse() or in
db_ctdb_traverse(), we already check for tombstone records and skip
them. Add the same check to db_ctdb_ltdb_parser() when fetching records
from the ltbd.

See also bug: https://bugzilla.samba.org/show_bug.cgi?id=10008

Commit 925625b52886d40b50fc631bad8bdc81970f7598 reverted part of the
patch in bug 10008 due to a deadlock it introduced.

This patch avoids the deadlock because it ensures db_ctdb_parse_record()
returns NT_STATUS_NOT_FOUND if an empty record is found in the ltdb and
not calling ctdb.

Bug: https://bugzilla.samba.org/show_bug.cgi?id=12005

Pair-Programmed-With: Michael Adam <obnox at samba.org>

Signed-off-by: Ralph Boehme <slow at samba.org>
Signed-off-by: Michael Adam <obnox at samba.org>
---
 source3/lib/dbwrap/dbwrap_ctdb.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/source3/lib/dbwrap/dbwrap_ctdb.c b/source3/lib/dbwrap/dbwrap_ctdb.c
index ededa1e..28dcf29 100644
--- a/source3/lib/dbwrap/dbwrap_ctdb.c
+++ b/source3/lib/dbwrap/dbwrap_ctdb.c
@@ -79,6 +79,7 @@ struct db_ctdb_ltdb_parse_state {
 	void (*parser)(TDB_DATA key, struct ctdb_ltdb_header *header,
 		       TDB_DATA data, void *private_data);
 	void *private_data;
+	bool empty_record;
 };
 
 static int db_ctdb_ltdb_parser(TDB_DATA key, TDB_DATA data,
@@ -91,6 +92,16 @@ static int db_ctdb_ltdb_parser(TDB_DATA key, TDB_DATA data,
 		return -1;
 	}
 
+	if (data.dsize == sizeof(struct ctdb_ltdb_header)) {
+		/*
+		 * A record consisting only of the ctdb header
+		 * can be a validly created empty record or a
+		 * tombstone record of a deleted record (not
+		 * vacuumed yet). Mark it.
+		 */
+		state->empty_record = true;
+	}
+
 	state->parser(
 		key, (struct ctdb_ltdb_header *)data.dptr,
 		make_tdb_data(data.dptr + sizeof(struct ctdb_ltdb_header),
@@ -105,6 +116,7 @@ struct db_ctdb_parse_record_state {
 	uint32_t my_vnn;
 	bool ask_for_readonly_copy;
 	bool done;
+	bool empty_record;
 };
 
 static NTSTATUS db_ctdb_ltdb_parse(
@@ -113,6 +125,8 @@ static NTSTATUS db_ctdb_ltdb_parse(
 		       TDB_DATA data, void *private_data),
 	void *private_data)
 {
+	struct db_ctdb_parse_record_state *rec_state =
+		(struct db_ctdb_parse_record_state *)private_data;
 	struct db_ctdb_ltdb_parse_state state;
 	int ret;
 
@@ -122,6 +136,9 @@ static NTSTATUS db_ctdb_ltdb_parse(
 	ret = tdb_parse_record(db->wtdb->tdb, key, db_ctdb_ltdb_parser,
 			       &state);
 	if (ret == -1) {
+		if (state.empty_record) {
+			rec_state->empty_record = true;
+		}
 		return NT_STATUS_NOT_FOUND;
 	}
 	return NT_STATUS_OK;
@@ -1267,6 +1284,7 @@ static NTSTATUS db_ctdb_parse_record(struct db_context *db, TDB_DATA key,
 	state.parser = parser;
 	state.private_data = private_data;
 	state.my_vnn = ctdbd_vnn(ctx->conn);
+	state.empty_record = false;
 
 	if (ctx->transaction != NULL) {
 		struct db_ctdb_transaction_handle *h = ctx->transaction;
@@ -1298,6 +1316,20 @@ static NTSTATUS db_ctdb_parse_record(struct db_context *db, TDB_DATA key,
 	status = db_ctdb_ltdb_parse(
 		ctx, key, db_ctdb_parse_record_parser_nonpersistent, &state);
 	if (NT_STATUS_IS_OK(status) && state.done) {
+		if (state.empty_record) {
+			/*
+			 * We know authoritatively, that this is an empty
+			 * record. Since ctdb does not distinguish between empty
+			 * and deleted records, this can be a record stored as
+			 * empty or a not-yet-vacuumed tombstone record of a
+			 * deleted record. Now Samba right now can live without
+			 * empty records, so we can safely report this record
+			 * as non-existing.
+			 *
+			 * See bugs 10008 and 12005.
+			 */
+			return NT_STATUS_NOT_FOUND;
+		}
 		return NT_STATUS_OK;
 	}
 
-- 
2.5.5

-------------- next part --------------
From de599ff9f8d0e3450f85d4ad9bed94c3e658a0e7 Mon Sep 17 00:00:00 2001
From: Michael Adam <obnox at samba.org>
Date: Mon, 8 Aug 2016 11:32:37 +0200
Subject: [PATCH] SQ: torture:smb2:lock:deadlock: modify test to use
 torture_assert macros

---
 source4/torture/smb2/lock.c | 50 ++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/source4/torture/smb2/lock.c b/source4/torture/smb2/lock.c
index dc1b91b..9c9465f 100644
--- a/source4/torture/smb2/lock.c
+++ b/source4/torture/smb2/lock.c
@@ -3041,9 +3041,9 @@ static bool test_deadlock(struct torture_context *torture,
 {
 	NTSTATUS status;
 	bool ret = true;
-	struct smb2_handle h;
+	struct smb2_handle _h;
+	struct smb2_handle *h = NULL;
 	uint8_t buf[200];
-	bool correct = true;
 	const char *fname = BASEDIR "\\deadlock.txt";
 
 	if (!lpcfg_clustering(torture->lp_ctx)) {
@@ -3051,31 +3051,43 @@ static bool test_deadlock(struct torture_context *torture,
 		return true;
 	}
 
-	status = torture_smb2_testdir(tree, BASEDIR, &h);
-	CHECK_STATUS(status, NT_STATUS_OK);
-	smb2_util_close(tree, h);
+	status = torture_smb2_testdir(tree, BASEDIR, &_h);
+	torture_assert_ntstatus_ok(torture, status,
+				   "torture_smb2_testdir failed");
+	smb2_util_close(tree, _h);
 
-	status = torture_smb2_testfile(tree, fname, &h);
-	CHECK_STATUS(status, NT_STATUS_OK);
+	status = torture_smb2_testfile(tree, fname, &_h);
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"torture_smb2_testfile failed");
+	h = &_h;
 
 	ZERO_STRUCT(buf);
-	status = smb2_util_write(tree, h, buf, 0, ARRAY_SIZE(buf));
-	CHECK_STATUS(status, NT_STATUS_OK);
+	status = smb2_util_write(tree, *h, buf, 0, ARRAY_SIZE(buf));
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"smb2_util_write failed");
 
-	ret = NT_STATUS_IS_OK(test_smb2_lock(tree, h, 0, 1, true));
-	EXPECTED(ret, true);
-	ret = NT_STATUS_IS_OK(test_smb2_unlock(tree, h, 0, 1));
-	EXPECTED(ret, true);
+	status = test_smb2_lock(tree, *h, 0, 1, true);
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"test_smb2_lock failed");
 
-	ret = NT_STATUS_IS_OK(test_smb2_lock(tree, h, 0, 1, true));
-	EXPECTED(ret, true);
-	ret = NT_STATUS_IS_OK(test_smb2_unlock(tree, h, 0, 1));
-	EXPECTED(ret, true);
+	status = test_smb2_unlock(tree, *h, 0, 1);
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"test_smb2_unlock failed");
+
+	status = test_smb2_lock(tree, *h, 0, 1, true);
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"test_smb2_lock failed");
+
+	status = test_smb2_unlock(tree, *h, 0, 1);
+	torture_assert_ntstatus_ok_goto(torture, status, ret, done,
+					"test_smb2_unlock failed");
 
 done:
-	smb2_util_close(tree, h);
+	if (h != NULL) {
+		smb2_util_close(tree, *h);
+	}
 	smb2_deltree(tree, BASEDIR);
-	return correct;
+	return ret;
 }
 
 /* basic testing of SMB2 locking
-- 
2.5.5

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160808/4b0acb70/signature.sig>