[RFC] tdb_traverse_read_lite()

Volker Lendecke Volker.Lendecke at SerNet.DE
Tue Mar 5 05:52:02 MST 2013


On Tue, Mar 05, 2013 at 01:33:18PM +0100, Stefan (metze) Metzmacher wrote:
> Am 05.03.2013 13:04, schrieb Volker Lendecke:
> > On Mon, Mar 04, 2013 at 05:57:15PM +1030, Rusty Russell wrote:
> >> Volker Lendecke <Volker.Lendecke at SerNet.DE> writes:
> >>> On Thu, Feb 28, 2013 at 02:11:18PM +1030, Rusty Russell wrote:
> >>>> Amitay asked me to look at a lightweight traverse for tdb.  This is my
> >>>> first attempt (against his ctdb tree, but it's pretty simple).
> >>>
> >>> What about doing a traverse by chain? Sure, not atomic
> >>> whatsoever, but potentially even less intrusive.
> >>>
> >>> Volker
> >>
> >> I'm not sure what you mean... this traverses one chain at a time, but it
> >> holds the lock for that chain across the traverse function.
> > 
> > Attached find a version of your code that actually applies
> > against master and builds there.
> > 
> > In particular for the vacuuming code in ctdb I think it
> > might be valuable to have a traverse function that is even
> > lighter. The traverse_read_chain might even be optionally
> > non-blocking. This way ctdb could much better control to
> > spend x % of its time in vacuuming or could be more generous
> > to other daemons on the system in a high-load situation. It
> > could walk all chains that are not locked and only if a
> > chain was not taken care of for n seconds it could try the
> > blocking variant.
> 
> I'd also prefer a chain traverse function, that could be used
> in a lot of places to reduce the cleanup costs.

While looking closer: The traverse_read_lite code seems to
not do the fancy locking that the next_hash_chain callback
assumes. See the big comment regarding empty hash chains in
the tdb_next_lock routine. Don't we also have to lock at
least once, like traverse_read_lite does? Also,
traverse_lite does not do the "Detect tight infinite loops"
thing, right?

Next one: Attached find an untested patch that makes
traverse_lite avoid allocating the whole thing. Instead, it
uses tdb_parse_data. malloc/free are not entirely free, they
can easily show up in profiles.

With best regards,

Volker Lendecke

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

**********************************************************
visit us at CeBIT: March 5th - 9th 2013, hall 6, booth E15
all about SAMBA and verinice, firewalls, Linux and Windows
free tickets available via email here : cebit at sernet.com !
**********************************************************
-------------- next part --------------
From 883764a84d1237d7375b55ad17ed5c1b3d6a40b3 Mon Sep 17 00:00:00 2001
From: Volker Lendecke <vl at samba.org>
Date: Tue, 5 Mar 2013 13:49:01 +0100
Subject: [PATCH] tdb: Avoid malloc/free in traverse_read_lite

---
 lib/tdb/common/traverse.c |   51 ++++++++++++++++++++++++++++++++------------
 1 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/lib/tdb/common/traverse.c b/lib/tdb/common/traverse.c
index 2bac689..e75993a 100644
--- a/lib/tdb/common/traverse.c
+++ b/lib/tdb/common/traverse.c
@@ -208,18 +208,43 @@ out:
 		return count;
 }
 
+struct traverse_read_lite_state {
+	struct tdb_context *tdb;
+	tdb_traverse_func fn;
+	void *private_data;
+	size_t key_length;
+};
+
+static int traverse_read_lite_parser(TDB_DATA key, TDB_DATA data,
+				     void *private_data)
+{
+	struct traverse_read_lite_state *state =
+		(struct traverse_read_lite_state *)private_data;
+
+	key.dptr = data.dptr;
+	key.dsize = state->key_length;
+	data.dptr += state->key_length;
+	data.dsize -= state->key_length;
+
+	return state->fn(state->tdb, key, data, state->private_data);
+}
+
 /* We could do write, but we'd have to stash rec.next and fixup on delete */
 static int do_traverse_read_lite(struct tdb_context *tdb,
 				 tdb_traverse_func fn, void *private_data)
 {
+	struct traverse_read_lite_state state;
 	int count = 0;
 	uint32_t i;
 
+	state.tdb = tdb;
+	state.fn = fn;
+	state.private_data = private_data;
+
 	for (i = 0, tdb->methods->next_hash_chain(tdb, &i);
 	     i < tdb->hash_size;
 	     i++, tdb->methods->next_hash_chain(tdb, &i)) {
 		tdb_off_t rec_ptr;
-		TDB_DATA key, dbuf;
 		struct tdb_record rec;
 
 		if (tdb_lock(tdb, i, F_RDLCK) != 0) {
@@ -237,25 +262,23 @@ static int do_traverse_read_lite(struct tdb_context *tdb,
 				rec_ptr = rec.next;
 				continue;
 			}
-			/* now read the full record */
-			key.dptr = tdb_alloc_read(tdb,
-						  rec_ptr + sizeof(rec),
-						  rec.key_len + rec.data_len);
-			if (!key.dptr) {
-				goto unlock_fail;
+
+			count++;
+
+			if (fn == NULL) {
+				rec_ptr = rec.next;
+				continue;
 			}
 
-			key.dsize = rec.key_len;
-			dbuf.dptr = key.dptr + rec.key_len;
-			dbuf.dsize = rec.data_len;
+			state.key_length = rec.key_len;
 
-			count++;
-			if (fn && fn(tdb, key, dbuf, private_data)) {
-				SAFE_FREE(key.dptr);
+			if (tdb_parse_data(
+				    tdb, tdb_null, rec_ptr + sizeof(rec),
+				    rec.key_len + rec.data_len,
+				    traverse_read_lite_parser, &state)) {
 				tdb_unlock(tdb, i, F_RDLCK);
 				goto out;
 			}
-			SAFE_FREE(key.dptr);
 			rec_ptr = rec.next;
 		}
 		if (tdb_unlock(tdb, i, F_RDLCK) != 0) {
-- 
1.7.3.4



More information about the samba-technical mailing list