[TDB] Patches for file and memory usage growth issues

Rusty Russell rusty at samba.org
Mon Apr 11 09:08:00 MDT 2011


On Mon, 11 Apr 2011 20:50:33 +1000, tridge at samba.org wrote:
>  > The second one needs discussion. I noticed that a transaction can
>  > substantially increase the size of a tdb, in my tests I had a 1.2G size
>  > TDB actually grow to a ~1.7G after a backup ... by not using a
>  > transaction to store the backup copy the same 1.2G TDB shrunk down to
>  > ~900MB. I know that the reason we use a transaction is to avoid rsyncs
>  > of partial DBs, so maybe we want to add an option on whether to use a
>  > transaction for the 2nd db ? Comments welcome.
> 
> I initially thought you were removing the transaction on the old db,
> but I see that you're only removing it on the new db. That should be
> OK, but we will need to at least get a write lock across the whole of
> tdb_new, and we should probably also do a fsync at the end on tdb_fd()
> before the close and rename.
> 
> Rusty, what do you think?

That seems like a good workaround to me.  But I wonder if TDB2 should
compress the recovery data.  I'll think about that...
 
>  > Finally I also am going to try to come to a solution for the huge memory
>  > usage caused by tdb_repack(). This function actually needs a lot of
>  > improvement. Firstly because it uses a transaction to copy the database
>  > twice do the repack.
> 
> tdb_repack() was really added for ctdb, and it suited the sizes of
> database and available memory for that application. As you have
> noticed, it is terrible for repacking a large persistent database.
> 
> Given that tdb 2.0 is just around the corner, perhaps you should work
> with Rusty to test to see if tdb_repack() is actually needed for sssd
> with the new tdb 2.0 format?

I didn't implement tdb_repack; I hope we won't get as fragmented as
TDB1 does (we do full coalescing).

Simo, here's my tdb_repack rewrite.  It's lightly tested, but I'd be
interested to see how much it helps you.  It's a bit more complex than
I'd hoped...

I uncovered a bug in the transaction code with TDB_CONVERT databases,
too.

The following changes since commit af45636166c7a0cb87630105d18ce489e7391525:

  Fix bug 8072 - PANIC: create_file_acl_common frees handle two times. (2011-04-09 02:05:15 +0200)

are available in the git repository at:
  ssh://git.samba.org/data/git/rusty/samba.git tdb-repack

Rusty Russell (2):
      tdb: fix transaction recovery area for converted tdbs.
      tdb: tdb_repack in place.

 lib/tdb/common/freelist.c    |   14 ++--
 lib/tdb/common/repack.c      |  208 ++++++++++++++++++++++++++++++++++++++++++
 lib/tdb/common/tdb.c         |   91 ------------------
 lib/tdb/common/tdb_private.h |    6 +
 lib/tdb/common/transaction.c |   50 +++++++----
 5 files changed, 254 insertions(+), 115 deletions(-)
 create mode 100644 lib/tdb/common/repack.c

commit 8412862673cce38ad473876f875174e8e40c53b3
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Tue Apr 12 00:19:56 2011 +0930

    tdb: fix transaction recovery area for converted tdbs.
    
    This is why macros are dangerous; these were converting the pointers, not the
    things pointed to!

diff --git a/lib/tdb/common/transaction.c b/lib/tdb/common/transaction.c
index c49de38..4bc8044 100644
--- a/lib/tdb/common/transaction.c
+++ b/lib/tdb/common/transaction.c
@@ -786,7 +786,7 @@ static int transaction_setup_recovery(struct tdb_context *tdb,
 	rec->data_len = recovery_size;
 	rec->rec_len  = recovery_max_size;
 	rec->key_len  = old_map_size;
-	CONVERT(rec);
+	CONVERT(*rec);
 
 	/* build the recovery data into a single blob to allow us to do a single
 	   large write, which should be more efficient */
@@ -833,7 +833,9 @@ static int transaction_setup_recovery(struct tdb_context *tdb,
 	/* and the tailer */
 	tailer = sizeof(*rec) + recovery_max_size;
 	memcpy(p, &tailer, 4);
-	CONVERT(p);
+	if (DOCONV()) {
+		tdb_convert(p, 4);
+	}
 
 	/* write the recovery data to the recovery area */
 	if (methods->tdb_write(tdb, recovery_offset, data, sizeof(*rec) + recovery_size) == -1) {

commit ce9083a11b5f761ea9383c9b7dc3fc70053350d6
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Tue Apr 12 00:26:09 2011 +0930

    tdb: tdb_repack in place.
    
    The current tdb_repack simply copies everything to an internal tdb, then
    erases the original and replaces everything.  Unfortunately, this requires
    about 3x as much memory as the original database.
    
    This code does a repack-in-place; the only real complication is avoiding
    overwriting the transaction recovery area (which cannot be altered inside
    a transaction).

diff --git a/lib/tdb/common/freelist.c b/lib/tdb/common/freelist.c
index 927078a..67b2e37 100644
--- a/lib/tdb/common/freelist.c
+++ b/lib/tdb/common/freelist.c
@@ -86,8 +86,8 @@ static int remove_from_freelist(struct tdb_context *tdb, tdb_off_t off, tdb_off_
 
 
 /* update a record tailer (must hold allocation lock) */
-static int update_tailer(struct tdb_context *tdb, tdb_off_t offset,
-			 const struct tdb_record *rec)
+int tdb_update_tailer(struct tdb_context *tdb, tdb_off_t offset,
+		      const struct tdb_record *rec)
 {
 	tdb_off_t totalsize;
 
@@ -106,7 +106,7 @@ int tdb_free(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec)
 		return -1;
 
 	/* set an initial tailer, so if we fail we don't leave a bogus record */
-	if (update_tailer(tdb, offset, rec) != 0) {
+	if (tdb_update_tailer(tdb, offset, rec) != 0) {
 		TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_free: update_tailer failed!\n"));
 		goto fail;
 	}
@@ -129,7 +129,7 @@ int tdb_free(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec)
 				goto left;
 			}
 			rec->rec_len += sizeof(r) + r.rec_len;
-			if (update_tailer(tdb, offset, rec) == -1) {
+			if (tdb_update_tailer(tdb, offset, rec) == -1) {
 				TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_free: update_tailer failed at %u\n", offset));
 				goto fail;
 			}
@@ -178,7 +178,7 @@ left:
 				TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_free: update_left failed at %u\n", left));
 				goto fail;
 			}
-			if (update_tailer(tdb, left, &l) == -1) {
+			if (tdb_update_tailer(tdb, left, &l) == -1) {
 				TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_free: update_tailer failed at %u\n", offset));
 				goto fail;
 			}
@@ -245,7 +245,7 @@ static tdb_off_t tdb_allocate_ofs(struct tdb_context *tdb,
 	if (tdb_rec_write(tdb, rec_ptr, rec) == -1) {
 		return 0;
 	}
-	if (update_tailer(tdb, rec_ptr, rec) == -1) {
+	if (tdb_update_tailer(tdb, rec_ptr, rec) == -1) {
 		return 0;
 	}
 
@@ -260,7 +260,7 @@ static tdb_off_t tdb_allocate_ofs(struct tdb_context *tdb,
 		return 0;
 	}
 
-	if (update_tailer(tdb, rec_ptr, rec) == -1) {
+	if (tdb_update_tailer(tdb, rec_ptr, rec) == -1) {
 		return 0;
 	}
 
diff --git a/lib/tdb/common/repack.c b/lib/tdb/common/repack.c
new file mode 100644
index 0000000..e5e7b7c
--- /dev/null
+++ b/lib/tdb/common/repack.c
@@ -0,0 +1,208 @@
+ /* 
+   Unix SMB/CIFS implementation.
+
+   trivial database library
+
+   Copyright (C) Rusty Russell			   2011
+
+     ** NOTE! The following LGPL license applies to the tdb
+     ** library. This does NOT imply that all of Samba is released
+     ** under the LGPL
+
+   This library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 3 of the License, or (at your option) any later version.
+
+   This library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with this library; if not, see <http://www.gnu.org/licenses/>.
+*/
+
+#include "tdb_private.h"
+
+static int add_free_filler_record(struct tdb_context *tdb,
+				  tdb_off_t off, tdb_off_t end)
+{
+	struct tdb_record r;
+
+	if (off == end) {
+		return 0;
+	}
+
+	r.magic = TDB_FREE_MAGIC;
+	r.rec_len = end - (off + sizeof(r));
+
+	if (tdb_update_tailer(tdb, off, &r) != 0) {
+		return -1;
+	}
+
+	/* Now, prepend to free list */
+	if (tdb_ofs_read(tdb, FREELIST_TOP, &r.next) == -1 ||
+	    tdb_rec_write(tdb, off, &r) == -1 ||
+	    tdb_ofs_write(tdb, FREELIST_TOP, &off) == -1) {
+		return -1;
+	}
+	return 0;
+}
+
+_PUBLIC_ int tdb_repack(struct tdb_context *tdb)
+{
+	tdb_off_t rec_ptr, off, recoff, prev;
+	tdb_len_t total_size = 0, limit = tdb->map_size, bufsize = 0;
+	unsigned int i;
+	const struct tdb_methods *direct = tdb->methods;
+	struct tdb_record recovery;
+	unsigned char *buf = NULL;
+
+	tdb_trace(tdb, "tdb_repack");
+
+	if (tdb_transaction_start(tdb) != 0) {
+		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to start transaction\n"));
+		return -1;
+	}
+
+	if (tdb_recovery_area(tdb, direct, &recoff, &recovery) == -1) {
+		goto fail;
+	}
+
+	/* Erase the free list. */
+	off = 0;
+	if (tdb_ofs_write(tdb, FREELIST_TOP, &off) == -1) {
+		goto fail;
+	}
+
+	off = TDB_DATA_START(tdb->header.hash_size);
+	for (i = 0; i < tdb->header.hash_size; i++) {
+		/* read in the hash top */
+		prev = TDB_HASH_TOP(i);
+		if (direct->tdb_read(tdb, prev, &rec_ptr,
+				     sizeof(rec_ptr), DOCONV()) == -1) {
+			goto fail;
+		}
+
+		/* keep looking until we find the right record */
+		while (rec_ptr) {
+			struct tdb_record r;
+			tdb_len_t totlen, min_alloc_len;
+			tdb_off_t end, next;
+
+			if (direct->tdb_read(tdb, rec_ptr, &r, sizeof(r),
+					     DOCONV()) != 0) {
+				goto fail;
+			}
+
+			if (TDB_DEAD(&r)) {
+				continue;
+			}
+
+			totlen = r.key_len + r.data_len;
+			/* See tdb_allocate: this is how it calculates. */
+			min_alloc_len = (totlen) * 1.25;
+			min_alloc_len += sizeof(tdb_off_t);
+			min_alloc_len = TDB_ALIGN(min_alloc_len, TDB_ALIGNMENT);
+
+			if (r.rec_len >= min_alloc_len) {
+				/* It has not expanded, so trim it now. */
+				r.rec_len = TDB_ALIGN(totlen
+						      + sizeof(tdb_off_t),
+						      TDB_ALIGNMENT);
+			}
+
+			end = off + sizeof(r) + r.rec_len;
+			/* Don't overwrite recovery header. */
+			if (recoff && end > recoff) {
+				if (add_free_filler_record(tdb, off, recoff)
+				    == -1) {
+					goto fail;
+				}
+				off = recoff + sizeof(recovery) +
+					recovery.rec_len;
+				end = off + sizeof(r) + r.rec_len;
+				recoff = 0;
+			}
+
+			/* Avoid sub-header-length spaces. */
+			if (recoff && end + sizeof(r) > recoff) {
+				end = recoff;
+				r.rec_len = end - (off + sizeof(r));
+			}
+
+			if (end + sizeof(r) > limit) {
+				end = limit;
+				r.rec_len = end - (off + sizeof(r));
+			}
+
+			/* Write new record as end of the list */
+			next = r.next;
+			r.next = 0;
+
+			/* Update previous to refer to this one. */
+			if (tdb_ofs_write(tdb, prev, &off) == -1) {
+				goto fail;
+			}
+
+			/* Write record header. */
+			if (tdb_rec_write(tdb, off, &r) == -1) {
+				goto fail;
+			}
+
+			if (totlen > bufsize) {
+				unsigned char *newbuf = realloc(buf, totlen);
+				if (!newbuf) {
+					tdb->ecode = TDB_ERR_OOM;
+					TDB_LOG((tdb, TDB_DEBUG_FATAL,
+						 "tdb_repack: failure to alloc"
+						 " %zu bytes", totlen));
+					goto fail;
+				}
+				buf = newbuf;
+				bufsize = totlen;
+			}
+			/* Copy record body. */
+			if (direct->tdb_read(tdb, rec_ptr + sizeof(r),
+					     buf, totlen, 0) != 0) {
+				goto fail;
+			}
+
+			if (tdb->methods->tdb_write(tdb, off + sizeof(r),
+						    buf, totlen) != 0) {
+				goto fail;
+			}
+			/* Set up tailer. */
+			if (tdb_update_tailer(tdb, off, &r) == -1) {
+				goto fail;
+			}
+			prev = off + offsetof(struct tdb_record, next);
+			off = end;
+
+			/* Simple loop detection. */
+			total_size += sizeof(r) + r.rec_len;
+			if (total_size > limit) {
+				tdb->ecode = TDB_ERR_CORRUPT;
+				TDB_LOG((tdb, TDB_DEBUG_FATAL,
+					 "tdb_repack: loop detected.\n"));
+				goto fail;
+			}
+
+			/* Move on to next one... */
+			rec_ptr = next;
+		}
+	}
+
+	if (add_free_filler_record(tdb, off, limit) == -1) {
+		goto fail;
+	}
+
+	free(buf);
+	return tdb_transaction_commit(tdb);
+
+fail:
+	tdb_transaction_cancel(tdb);
+	free(buf);
+	return -1;
+}
diff --git a/lib/tdb/common/tdb.c b/lib/tdb/common/tdb.c
index 66be555..33b6034 100644
--- a/lib/tdb/common/tdb.c
+++ b/lib/tdb/common/tdb.c
@@ -898,97 +898,6 @@ failed:
 	return -1;
 }
 
-struct traverse_state {
-	bool error;
-	struct tdb_context *dest_db;
-};
-
-/*
-  traverse function for repacking
- */
-static int repack_traverse(struct tdb_context *tdb, TDB_DATA key, TDB_DATA data, void *private_data)
-{
-	struct traverse_state *state = (struct traverse_state *)private_data;
-	if (tdb_store(state->dest_db, key, data, TDB_INSERT) != 0) {
-		state->error = true;
-		return -1;
-	}
-	return 0;
-}
-
-/*
-  repack a tdb
- */
-_PUBLIC_ int tdb_repack(struct tdb_context *tdb)
-{
-	struct tdb_context *tmp_db;
-	struct traverse_state state;
-
-	tdb_trace(tdb, "tdb_repack");
-
-	if (tdb_transaction_start(tdb) != 0) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to start transaction\n"));
-		return -1;
-	}
-
-	tmp_db = tdb_open("tmpdb", tdb_hash_size(tdb), TDB_INTERNAL, O_RDWR|O_CREAT, 0);
-	if (tmp_db == NULL) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to create tmp_db\n"));
-		tdb_transaction_cancel(tdb);
-		return -1;
-	}
-
-	state.error = false;
-	state.dest_db = tmp_db;
-
-	if (tdb_traverse_read(tdb, repack_traverse, &state) == -1) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to traverse copying out\n"));
-		tdb_transaction_cancel(tdb);
-		tdb_close(tmp_db);
-		return -1;		
-	}
-
-	if (state.error) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Error during traversal\n"));
-		tdb_transaction_cancel(tdb);
-		tdb_close(tmp_db);
-		return -1;
-	}
-
-	if (tdb_wipe_all(tdb) != 0) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to wipe database\n"));
-		tdb_transaction_cancel(tdb);
-		tdb_close(tmp_db);
-		return -1;
-	}
-
-	state.error = false;
-	state.dest_db = tdb;
-
-	if (tdb_traverse_read(tmp_db, repack_traverse, &state) == -1) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to traverse copying back\n"));
-		tdb_transaction_cancel(tdb);
-		tdb_close(tmp_db);
-		return -1;		
-	}
-
-	if (state.error) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Error during second traversal\n"));
-		tdb_transaction_cancel(tdb);
-		tdb_close(tmp_db);
-		return -1;
-	}
-
-	tdb_close(tmp_db);
-
-	if (tdb_transaction_commit(tdb) != 0) {
-		TDB_LOG((tdb, TDB_DEBUG_FATAL, __location__ " Failed to commit\n"));
-		return -1;
-	}
-
-	return 0;
-}
-
 /* Even on files, we can get partial writes due to signals. */
 bool tdb_write_all(int fd, const void *buf, size_t count)
 {
diff --git a/lib/tdb/common/tdb_private.h b/lib/tdb/common/tdb_private.h
index 0186fb9..a5f671a 100644
--- a/lib/tdb/common/tdb_private.h
+++ b/lib/tdb/common/tdb_private.h
@@ -238,6 +238,10 @@ void tdb_release_transaction_locks(struct tdb_context *tdb);
 int tdb_transaction_lock(struct tdb_context *tdb, int ltype,
 			 enum tdb_lock_flags lockflags);
 int tdb_transaction_unlock(struct tdb_context *tdb, int ltype);
+int tdb_recovery_area(struct tdb_context *tdb,
+		      const struct tdb_methods *methods,
+		      tdb_off_t *recovery_offset,
+		      struct tdb_record *rec);
 int tdb_allrecord_lock(struct tdb_context *tdb, int ltype,
 		       enum tdb_lock_flags flags, bool upgradable);
 int tdb_allrecord_unlock(struct tdb_context *tdb, int ltype, bool mark_lock);
@@ -249,6 +253,8 @@ int tdb_ofs_write(struct tdb_context *tdb, tdb_off_t offset, tdb_off_t *d);
 void *tdb_convert(void *buf, uint32_t size);
 int tdb_free(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec);
 tdb_off_t tdb_allocate(struct tdb_context *tdb, tdb_len_t length, struct tdb_record *rec);
+int tdb_update_tailer(struct tdb_context *tdb, tdb_off_t offset,
+		      const struct tdb_record *rec);
 int tdb_ofs_read(struct tdb_context *tdb, tdb_off_t offset, tdb_off_t *d);
 int tdb_ofs_write(struct tdb_context *tdb, tdb_off_t offset, tdb_off_t *d);
 int tdb_lock_record(struct tdb_context *tdb, tdb_off_t off);
diff --git a/lib/tdb/common/transaction.c b/lib/tdb/common/transaction.c
index 4bc8044..b6cf771 100644
--- a/lib/tdb/common/transaction.c
+++ b/lib/tdb/common/transaction.c
@@ -658,6 +658,34 @@ static tdb_len_t tdb_recovery_size(struct tdb_context *tdb)
 	return recovery_size;
 }
 
+int tdb_recovery_area(struct tdb_context *tdb,
+		      const struct tdb_methods *methods,
+		      tdb_off_t *recovery_offset,
+		      struct tdb_record *rec)
+{
+	if (tdb_ofs_read(tdb, TDB_RECOVERY_HEAD, recovery_offset) == -1) {
+		return -1;
+	}
+
+	if (*recovery_offset == 0) {
+		rec->rec_len = 0;
+		return 0;
+	}
+
+	if (methods->tdb_read(tdb, *recovery_offset, rec, sizeof(*rec),
+			      DOCONV()) == -1) {
+		return -1;
+	}
+
+	/* ignore invalid recovery regions: can happen in crash */
+	if (rec->magic != TDB_RECOVERY_MAGIC &&
+	    rec->magic != TDB_RECOVERY_INVALID_MAGIC) {
+		*recovery_offset = 0;
+		rec->rec_len = 0;
+	}
+	return 0;
+}
+
 /*
   allocate the recovery area, or use an existing recovery area if it is
   large enough
@@ -671,25 +699,11 @@ static int tdb_recovery_allocate(struct tdb_context *tdb,
 	const struct tdb_methods *methods = tdb->transaction->io_methods;
 	tdb_off_t recovery_head;
 
-	if (tdb_ofs_read(tdb, TDB_RECOVERY_HEAD, &recovery_head) == -1) {
+	if (tdb_recovery_area(tdb, methods, &recovery_head, &rec) == -1) {
 		TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_recovery_allocate: failed to read recovery head\n"));
 		return -1;
 	}
 
-	rec.rec_len = 0;
-
-	if (recovery_head != 0) {
-		if (methods->tdb_read(tdb, recovery_head, &rec, sizeof(rec), DOCONV()) == -1) {
-			TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_recovery_allocate: failed to read recovery record\n"));
-			return -1;
-		}
-		/* ignore invalid recovery regions: can happen in crash */
-		if (rec.magic != TDB_RECOVERY_MAGIC &&
-		    rec.magic != TDB_RECOVERY_INVALID_MAGIC) {
-			recovery_head = 0;
-		}
-	}
-
 	*recovery_size = tdb_recovery_size(tdb);
 
 	if (recovery_head != 0 && *recovery_size <= rec.rec_len) {


More information about the samba-technical mailing list