[RFC] tdb_traverse_read_lite()

Volker Lendecke Volker.Lendecke at SerNet.DE
Tue Mar 5 05:04:06 MST 2013


On Mon, Mar 04, 2013 at 05:57:15PM +1030, Rusty Russell wrote:
> Volker Lendecke <Volker.Lendecke at SerNet.DE> writes:
> > On Thu, Feb 28, 2013 at 02:11:18PM +1030, Rusty Russell wrote:
> >> Amitay asked me to look at a lightweight traverse for tdb.  This is my
> >> first attempt (against his ctdb tree, but it's pretty simple).
> >
> > What about doing a traverse by chain? Sure, not atomic
> > whatsoever, but potentially even less intrusive.
> >
> > Volker
> 
> I'm not sure what you mean... this traverses one chain at a time, but it
> holds the lock for that chain across the traverse function.

Attached find a version of your code that actually applies
against master and builds there.

In particular for the vacuuming code in ctdb I think it
might be valuable to have a traverse function that is even
lighter. The traverse_read_chain might even be optionally
non-blocking. This way ctdb could much better control to
spend x % of its time in vacuuming or could be more generous
to other daemons on the system in a high-load situation. It
could walk all chains that are not locked and only if a
chain was not taken care of for n seconds it could try the
blocking variant.

With best regards,

Volker Lendecke

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

**********************************************************
visit us at CeBIT: March 5th - 9th 2013, hall 6, booth E15
all about SAMBA and verinice, firewalls, Linux and Windows
free tickets available via email here : cebit at sernet.com !
**********************************************************
-------------- next part --------------
From ddbf1e130d8af85189136f4c71c4b4287c22862a Mon Sep 17 00:00:00 2001
From: Rusty Russell <rusty at rustcorp.com.au>
Date: Thu, 28 Feb 2013 14:06:53 +1030
Subject: [PATCH] tdb_traverse_read_lite(): lightweight traverse for tdb.

This holds the lock while calling the function, which is much faster
and avoids the "what happens if someone deletes underneath" case.  For
short hash chains this should be a performance win.

Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>
---
 lib/tdb/common/traverse.c  |   81 ++++++++++++++++++++++++++++++++++++++++++++
 lib/tdb/include/tdb.h      |   25 ++++++++++++++
 lib/tdb/tools/tdbtorture.c |    8 +++++
 3 files changed, 114 insertions(+)

diff --git a/lib/tdb/common/traverse.c b/lib/tdb/common/traverse.c
index a843359..2bac689 100644
--- a/lib/tdb/common/traverse.c
+++ b/lib/tdb/common/traverse.c
@@ -208,6 +208,87 @@ out:
 		return count;
 }
 
+/* We could do write, but we'd have to stash rec.next and fixup on delete */
+static int do_traverse_read_lite(struct tdb_context *tdb,
+				 tdb_traverse_func fn, void *private_data)
+{
+	int count = 0;
+	uint32_t i;
+
+	for (i = 0, tdb->methods->next_hash_chain(tdb, &i);
+	     i < tdb->hash_size;
+	     i++, tdb->methods->next_hash_chain(tdb, &i)) {
+		tdb_off_t rec_ptr;
+		TDB_DATA key, dbuf;
+		struct tdb_record rec;
+
+		if (tdb_lock(tdb, i, F_RDLCK) != 0) {
+			goto fail;
+		}
+		if (tdb_ofs_read(tdb, TDB_HASH_TOP(i), &rec_ptr) != 0) {
+			goto unlock_fail;
+		}
+		while (rec_ptr) {
+			if (tdb_rec_read(tdb, rec_ptr, &rec) == -1) {
+				goto unlock_fail;
+			}
+
+			if (TDB_DEAD(&rec)) {
+				rec_ptr = rec.next;
+				continue;
+			}
+			/* now read the full record */
+			key.dptr = tdb_alloc_read(tdb,
+						  rec_ptr + sizeof(rec),
+						  rec.key_len + rec.data_len);
+			if (!key.dptr) {
+				goto unlock_fail;
+			}
+
+			key.dsize = rec.key_len;
+			dbuf.dptr = key.dptr + rec.key_len;
+			dbuf.dsize = rec.data_len;
+
+			count++;
+			if (fn && fn(tdb, key, dbuf, private_data)) {
+				SAFE_FREE(key.dptr);
+				tdb_unlock(tdb, i, F_RDLCK);
+				goto out;
+			}
+			SAFE_FREE(key.dptr);
+			rec_ptr = rec.next;
+		}
+		if (tdb_unlock(tdb, i, F_RDLCK) != 0) {
+			goto fail;
+		}
+	}
+
+out:
+	return count;
+
+unlock_fail:
+	tdb_unlock(tdb, i, F_RDLCK);
+fail:
+	return -1;
+}
+
+/*
+  a read traverse, which doesn't drop the lock calling fn().  Great if
+  fn() is fast, since traverse is simpler and faster, but will block
+  any write accesses to that chain.
+ */
+_PUBLIC_ int tdb_traverse_read_lite(struct tdb_context *tdb,
+				    tdb_traverse_func fn, void *private_data)
+{
+	int ret;
+
+	/* This enables safety checks that they don't write to db */
+	tdb->traverse_read++;
+	tdb_trace(tdb, "tdb_traverse_read_lite");
+	ret = do_traverse_read_lite(tdb, fn, private_data);
+	tdb->traverse_read--;
+	return ret;
+}
 
 /*
   a read style traverse - temporarily marks the db read only
diff --git a/lib/tdb/include/tdb.h b/lib/tdb/include/tdb.h
index e371e33..299f7b1 100644
--- a/lib/tdb/include/tdb.h
+++ b/lib/tdb/include/tdb.h
@@ -441,6 +441,31 @@ int tdb_traverse(struct tdb_context *tdb, tdb_traverse_func fn, void *private_da
 int tdb_traverse_read(struct tdb_context *tdb, tdb_traverse_func fn, void *private_data);
 
 /**
+ * @brief Traverse the entire database.
+ *
+ * While traversing the database the function fn(tdb, key, data, state) is
+ * called on each element, but marking the database read only during the
+ * traversal, so any write operations will fail. This allows tdb to use read
+ * locks, which increases the parallelism possible during the traversal.
+ *
+ * Use tdb_traverse_read_lite() version if the callback function fn is very
+ * light-weight. tdb_traverse_read() drops the chainlock while the callback is
+ * called to allow longer-running callbacks. tdb_traverse_read_lite() does not
+ * drop the chainlock while the callback is called, reducing the locking
+ * overhead but potentially increasing the latency for other tdb users.
+ *
+ * @param[in]  tdb      The database to traverse.
+ *
+ * @param[in]  fn       The function to call on each entry.
+ *
+ * @param[in]  private_data The private data which should be passed to the
+ *                          traversing function.
+ *
+ * @return              The record count traversed, -1 on error.
+ */
+int tdb_traverse_read_lite(struct tdb_context *tdb, tdb_traverse_func fn, void *);
+
+/**
  * @brief Check if an entry in the database exists.
  *
  * @note 1 is returned if the key is found and 0 is returned if not found this
diff --git a/lib/tdb/tools/tdbtorture.c b/lib/tdb/tools/tdbtorture.c
index 760ad59..3e0ac92 100644
--- a/lib/tdb/tools/tdbtorture.c
+++ b/lib/tdb/tools/tdbtorture.c
@@ -22,6 +22,7 @@
 #define LOCKSTORE_PROB 5
 #define TRAVERSE_PROB 20
 #define TRAVERSE_READ_PROB 20
+#define TRAVERSE_READ_LITE_PROB 20
 #define CULL_PROB 100
 #define KEYLEN 3
 #define DATALEN 100
@@ -199,6 +200,13 @@ static void addrec_db(void)
 	}
 #endif
 
+#if TRAVERSE_READ_LITE_PROB
+	if (random() % TRAVERSE_READ_LITE_PROB == 0) {
+		tdb_traverse_read_lite(db, NULL, NULL);
+		goto next;
+	}
+#endif
+
 	data = tdb_fetch(db, key);
 	if (data.dptr) free(data.dptr);
 
-- 
1.7.9.5



More information about the samba-technical mailing list