[SCM] Samba Shared Repository - branch master updated

Tue Feb 23 22:57:52 MST 2010

The branch, master has been updated
       via  ec96ea6... tdb: handle processes dying during transaction commit.
       via  1bf482b... patch tdb-refactor-tdb_lock-and-tdb_lock_nonblock.patch
       via  ececeff... tdb: add -k option to tdbtorture
       via  8c3fda4... tdb: don't truncate tdb on recovery
       via  9f295ee... tdb: remove lock ops
       via  a84222b... tdb: rename tdb_release_extra_locks() to tdb_release_transaction_locks()
       via  dd1b508... tdb: cleanup: remove ltype argument from _tdb_transaction_cancel.
       via  fca1621... tdb: tdb_allrecord_lock/tdb_allrecord_unlock/tdb_allrecord_upgrade
       via  caaf5c6... tdb: suppress record write locks when allrecord lock is taken.
       via  9341f23... tdb: cleanup: always grab allrecord lock to infinity.
       via  1ab8776... tdb: remove num_locks
       via  d48c3e4... tdb: use tdb_nest_lock() for seqnum lock.
       via  4738d47... tdb: use tdb_nest_lock() for active lock.
       via  9136818... tdb: use tdb_nest_lock() for open lock.
       via  e8fa70a... tdb: use tdb_nest_lock() for transaction lock.
       via  ce41411... tdb: cleanup: find_nestlock() helper.
       via  db27073... tdb: cleanup: tdb_release_extra_locks() helper
       via  fba42f1... tdb: cleanup: tdb_have_extra_locks() helper
       via  b754f61... tdb: don't suppress the transaction lock because of the allrecord lock.
       via  5d9de60... tdb: cleanup: tdb_nest_lock/tdb_nest_unlock
       via  e9114a7... tdb: cleanup: rename global_lock to allrecord_lock.
       via  7ab422d... tdb: cleanup: rename GLOBAL_LOCK to OPEN_LOCK.
       via  a6e0ef8... tdb: make _tdb_transaction_cancel static.
       via  452b4a5... tdb: cleanup: split brlock and brunlock methods.
      from  fffdce6... s4/schema: Move msDS-IntId implementation to samldb.c module

http://gitweb.samba.org/?p=samba.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit ec96ea690edbe3398d690b4a953d487ca1773f1c
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 13:23:58 2010 +1030

    tdb: handle processes dying during transaction commit.
    
    tdb transactions were designed to be robust against the machine
    powering off, but interestingly were never designed to handle the case
    where an administrator kill -9's a process during commit.  Because
    recovery is only done on tdb_open, processes with the tdb already
    mapped will simply use it despite it being corrupt and needing
    recovery.
    
    The solution to this is to check for recovery every time we grab a
    data lock: we could have gained the lock because a process just died.
    This has no measurable cost: here is the time for tdbtorture -s 0 -n 1
    -l 10000:
    
    Before:
    	2.75 2.50 2.81 3.19 2.91 2.53 2.72 2.50 2.78 2.77 = Avg 2.75
    
    After:
    	2.81 2.57 3.42 2.49 3.02 2.49 2.84 2.48 2.80 2.43 = Avg 2.74
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 1bf482b9ef9ec73dd7ee4387d7087aa3955503dd
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 13:18:06 2010 +1030

    patch tdb-refactor-tdb_lock-and-tdb_lock_nonblock.patch

commit ececeffd85db1b27c07cdf91a921fd203006daf6
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:53:05 2010 +1030

    tdb: add -k option to tdbtorture
    
    To test the case of death of a process during transaction commit, add
    a -k (kill random) option to tdbtorture.  The easiest way to do this
    is to make every worker a child (unless there's only one child), which
    is why this patch is bigger than you might expect.
    
    Using -k without -t (always transactions) you expect corruption, though
    it doesn't happen every time.  With -t, we currently get corruption but
    the next patch fixes that.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 8c3fda4318adc71899bc41486d5616da3a91a688
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:50:41 2010 +1030

    tdb: don't truncate tdb on recovery
    
    The current recovery code truncates the tdb file on recovery.  This is
    fine if recovery is only done on first open, but is a really bad idea
    as we move to allowing recovery on "live" databases.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 9f295eecffd92e55584fc36539cd85cd32c832de
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:49:22 2010 +1030

    tdb: remove lock ops
    
    Now the transaction code uses the standard allrecord lock, that stops
    us from trying to grab any per-record locks anyway.  We don't need to
    have special noop lock ops for transactions.
    
    This is a nice simplification: if you see brlock, you know it's really
    going to grab a lock.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit a84222bbaf9ed2c7b9c61b8157b2e3c85f17fa32
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 11:02:55 2010 +1030

    tdb: rename tdb_release_extra_locks() to tdb_release_transaction_locks()
    
    tdb_release_extra_locks() is too general: it carefully skips over the
    transaction lock, even though the only caller then drops it.  Change
    this, and rename it to show it's clearly transaction-specific.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit dd1b508c63034452673dbfee9956f52a1b6c90a5
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 12:42:24 2010 +1030

    tdb: cleanup: remove ltype argument from _tdb_transaction_cancel.
    
    Now the transaction allrecord lock is the standard one, and thus is cleaned
    in tdb_release_extra_locks(), _tdb_transaction_cancel() doesn't need to
    know what type it is.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit fca1621965c547e2d076eca2a2599e9629f91266
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 15:42:15 2010 +1030

    tdb: tdb_allrecord_lock/tdb_allrecord_unlock/tdb_allrecord_upgrade
    
    Centralize locking of all chains of the tdb; rename _tdb_lockall to
    tdb_allrecord_lock and _tdb_unlockall to tdb_allrecord_unlock, and
    tdb_brlock_upgrade to tdb_allrecord_upgrade.
    
    Then we use this in the transaction code.  Unfortunately, if the transaction
    code records that it has grabbed the allrecord lock read-only, write locks
    will fail, so we treat this upgradable lock as a write lock, and mark it
    as upgradable using the otherwise-unused offset field.
    
    One subtlety: now the transaction code is using the allrecord_lock, the
    tdb_release_extra_locks() function drops it for us, so we no longer need
    to do it manually in _tdb_transaction_cancel.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit caaf5c6baa1a4f340c1f38edd99b3a8b56621b8b
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:45:26 2010 +1030

    tdb: suppress record write locks when allrecord lock is taken.
    
    Records themselves get (read) locked by the traversal code against delete.
    Interestingly, this locking isn't done when the allrecord lock has been
    taken, though the allrecord lock until recently didn't cover the actual
    records (it now goes to end of file).
    
    The write record lock, grabbed by the delete code, is not suppressed
    by the allrecord lock.  This is now bad: it causes us to punch a hole
    in the allrecord lock when we release the write record lock.  Make this
    consistent: *no* record locks of any kind when the allrecord lock is
    taken.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 9341f230f8968b4b18e451d15dda5ccbe7787768
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:45:14 2010 +1030

    tdb: cleanup: always grab allrecord lock to infinity.
    
    We were previously inconsistent with our "global" lock: the
    transaction code grabbed it from FREELIST_TOP to end of file, and the
    rest of the code grabbed it from FREELIST_TOP to end of the hash
    chains.  Change it to always grab to end of file for simplicity and
    so we can merge the two.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 1ab8776247f89b143b6e58f4b038ab4bcea20d3a
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 15:01:07 2010 +1030

    tdb: remove num_locks
    
    This was redundant before this patch series: it mirrored num_lockrecs
    exactly.  It still does.
    
    Also, skip useless branch when locks == 1: unconditional assignment is
    cheaper anyway.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit d48c3e4982a38fb6b568ed3903e55e07a0fe5ca6
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:40:57 2010 +1030

    tdb: use tdb_nest_lock() for seqnum lock.
    
    This is pure overhead, but it centralizes the locking.  Realloc (esp. as
    most implementations are lazy) is fast compared to the fnctl anyway.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 4738d474c412cc59d26fcea64007e99094e8b675
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:44:40 2010 +1030

    tdb: use tdb_nest_lock() for active lock.
    
    Use our newly-generic nested lock tracking for the active lock.
    
    Note that the tdb_have_extra_locks() and tdb_release_extra_locks()
    functions have to skip over this lock now it is tracked.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 9136818df30c7179e1cffa18201cdfc990ebd7b7
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Mon Feb 22 13:58:07 2010 +1030

    tdb: use tdb_nest_lock() for open lock.
    
    This never nests, so it's overkill, but it centralizes the locking into
    lock.c and removes the ugly flag in the transaction code to track whether
    we have the lock or not.
    
    Note that we have a temporary hack so this places a real lock, despite
    the fact that we are in a transaction.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit e8fa70a321d489b454b07bd65e9b0d95084168de
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:37:34 2010 +1030

    tdb: use tdb_nest_lock() for transaction lock.
    
    Rather than a boutique lock and a separate nest count, use our
    newly-generic nested lock tracking for the transaction lock.
    
    Note that the tdb_have_extra_locks() and tdb_release_extra_locks()
    functions have to skip over this lock now it is tracked.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit ce41411c84760684ce539b6a302a0623a6a78a72
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:35:54 2010 +1030

    tdb: cleanup: find_nestlock() helper.
    
    Factor out two loops which find locks; we are going to introduce a couple
    more so a helper makes sense.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit db270734d8b4208e00ce9de5af1af7ee11823f6d
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:41:15 2010 +1030

    tdb: cleanup: tdb_release_extra_locks() helper
    
    Move locking intelligence back into lock.c, rather than open-coding the
    lock release in transaction.c.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit fba42f1fb4f81b8913cce5a23ca5350ba45f40e1
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:34:26 2010 +1030

    tdb: cleanup: tdb_have_extra_locks() helper
    
    In many places we check whether locks are held: add a helper to do this.
    
    The _tdb_lockall() case has already checked for the allrecord lock, so
    the extra work done by tdb_have_extra_locks() is merely redundant.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit b754f61d235bdc3e410b60014d6be4072645e16f
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:31:49 2010 +1030

    tdb: don't suppress the transaction lock because of the allrecord lock.
    
    tdb_transaction_lock() and tdb_transaction_unlock() do nothing if we
    hold the allrecord lock.  However, the two locks don't overlap, so
    this is wrong.
    
    This simplification makes the transaction lock a straight-forward nested
    lock.
    
    There are two callers for these functions:
    1) The transaction code, which already makes sure the allrecord_lock
       isn't held.
    2) The traverse code, which wants to stop transactions whether it has the
       allrecord lock or not.  There have been deadlocks here before, however
       this should not bring them back (I hope!)
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 5d9de604d92d227899e9b861c6beafb2e4fa61e0
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:26:13 2010 +1030

    tdb: cleanup: tdb_nest_lock/tdb_nest_unlock
    
    Because fcntl locks don't nest, we track them in the tdb->lockrecs array
    and only place/release them when the count goes to 1/0.  We only do this
    for record locks, so we simply place the list number (or -1 for the free
    list) in the structure.
    
    To generalize this:
    
    1) Put the offset rather than list number in struct tdb_lock_type.
    2) Rename _tdb_lock() to tdb_nest_lock, make it non-static and move the
       allrecord check out to the callers (except the mark case which doesn't
       care).
    3) Rename _tdb_unlock() to tdb_nest_unlock(), make it non-static and
       move the allrecord out to the callers (except mark again).
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit e9114a758538d460d4f9deae5ce631bf44b1eff8
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:19:47 2010 +1030

    tdb: cleanup: rename global_lock to allrecord_lock.
    
    The word global is overloaded in tdb.  The global_lock inside struct
    tdb_context is used to indicate we hold a lock across all the chains.
    
    Rename it to allrecord_lock.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit 7ab422d6fbd4f8be02838089a41f872d538ee7a7
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:18:33 2010 +1030

    tdb: cleanup: rename GLOBAL_LOCK to OPEN_LOCK.
    
    The word global is overloaded in tdb.  The GLOBAL_LOCK offset is used at
    open time to serialize initialization (and by the transaction code to block
    open).
    
    Rename it to OPEN_LOCK.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

commit a6e0ef87d25734760fe77b87a9fd11db56760955
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 24 10:39:59 2010 +1030

    tdb: make _tdb_transaction_cancel static.
    
    Now tdb_open() calls tdb_transaction_cancel() instead of
    _tdb_transaction_cancel, we can make it static.
    
    Signed-off-by: Rusty Russell<rusty at rustcorp.com.au>

commit 452b4a5a6efeecfb5c83475f1375ddc25bcddfbe
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Wed Feb 17 12:17:19 2010 +1030

    tdb: cleanup: split brlock and brunlock methods.
    
    This is taken from the CCAN code base: rather than using tdb_brlock for
    locking and unlocking, we split it into brlock and brunlock functions.
    
    For extra debugging information, brunlock says what kind of lock it is
    unlocking (even though fnctl locks don't need this).  This requires an
    extra argument to tdb_transaction_unlock() so we know whether the
    lock was upgraded to a write lock or not.
    
    We also use a "flags" argument tdb_brlock:
    1) TDB_LOCK_NOWAIT replaces lck_type = F_SETLK (vs F_SETLKW).
    2) TDB_LOCK_MARK_ONLY replaces setting TDB_MARK_LOCK bit in ltype.
    3) TDB_LOCK_PROBE replaces the "probe" argument.
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

-----------------------------------------------------------------------

Summary of changes:
 lib/tdb/common/io.c          |    1 -
 lib/tdb/common/lock.c        |  578 +++++++++++++++++++++++++++++-------------
 lib/tdb/common/open.c        |   32 ++-
 lib/tdb/common/tdb.c         |    7 +-
 lib/tdb/common/tdb_private.h |   39 ++-
 lib/tdb/common/transaction.c |  107 +++-----
 lib/tdb/common/traverse.c    |    4 +-
 lib/tdb/tools/tdbtorture.c   |  199 +++++++++++----
 8 files changed, 636 insertions(+), 331 deletions(-)


Changeset truncated at 500 lines:

diff --git a/lib/tdb/common/io.c b/lib/tdb/common/io.c
index d549715..5b20fa1 100644
--- a/lib/tdb/common/io.c
+++ b/lib/tdb/common/io.c
@@ -461,7 +461,6 @@ static const struct tdb_methods io_methods = {
 	tdb_next_hash_chain,
 	tdb_oob,
 	tdb_expand_file,
-	tdb_brlock
 };
 
 /*
diff --git a/lib/tdb/common/lock.c b/lib/tdb/common/lock.c
index 0984e51..65d6843 100644
--- a/lib/tdb/common/lock.c
+++ b/lib/tdb/common/lock.c
@@ -27,13 +27,104 @@
 
 #include "tdb_private.h"
 
-#define TDB_MARK_LOCK 0x80000000
-
 void tdb_setalarm_sigptr(struct tdb_context *tdb, volatile sig_atomic_t *ptr)
 {
 	tdb->interrupt_sig_ptr = ptr;
 }
 
+static int fcntl_lock(struct tdb_context *tdb,
+		      int rw, off_t off, off_t len, bool waitflag)
+{
+	struct flock fl;
+
+	fl.l_type = rw;
+	fl.l_whence = SEEK_SET;
+	fl.l_start = off;
+	fl.l_len = len;
+	fl.l_pid = 0;
+
+	if (waitflag)
+		return fcntl(tdb->fd, F_SETLKW, &fl);
+	else
+		return fcntl(tdb->fd, F_SETLK, &fl);
+}
+
+static int fcntl_unlock(struct tdb_context *tdb, int rw, off_t off, off_t len)
+{
+	struct flock fl;
+#if 0 /* Check they matched up locks and unlocks correctly. */
+	char line[80];
+	FILE *locks;
+	bool found = false;
+
+	locks = fopen("/proc/locks", "r");
+
+	while (fgets(line, 80, locks)) {
+		char *p;
+		int type, start, l;
+
+		/* eg. 1: FLOCK  ADVISORY  WRITE 2440 08:01:2180826 0 EOF */
+		p = strchr(line, ':') + 1;
+		if (strncmp(p, " POSIX  ADVISORY  ", strlen(" POSIX  ADVISORY  ")))
+			continue;
+		p += strlen(" FLOCK  ADVISORY  ");
+		if (strncmp(p, "READ  ", strlen("READ  ")) == 0)
+			type = F_RDLCK;
+		else if (strncmp(p, "WRITE ", strlen("WRITE ")) == 0)
+			type = F_WRLCK;
+		else
+			abort();
+		p += 6;
+		if (atoi(p) != getpid())
+			continue;
+		p = strchr(strchr(p, ' ') + 1, ' ') + 1;
+		start = atoi(p);
+		p = strchr(p, ' ') + 1;
+		if (strncmp(p, "EOF", 3) == 0)
+			l = 0;
+		else
+			l = atoi(p) - start + 1;
+
+		if (off == start) {
+			if (len != l) {
+				fprintf(stderr, "Len %u should be %u: %s",
+					(int)len, l, line);
+				abort();
+			}
+			if (type != rw) {
+				fprintf(stderr, "Type %s wrong: %s",
+					rw == F_RDLCK ? "READ" : "WRITE", line);
+				abort();
+			}
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		fprintf(stderr, "Unlock on %u@%u not found!\n",
+			(int)off, (int)len);
+		abort();
+	}
+
+	fclose(locks);
+#endif
+
+	fl.l_type = F_UNLCK;
+	fl.l_whence = SEEK_SET;
+	fl.l_start = off;
+	fl.l_len = len;
+	fl.l_pid = 0;
+
+	return fcntl(tdb->fd, F_SETLKW, &fl);
+}
+
+/* list -1 is the alloc list, otherwise a hash chain. */
+static tdb_off_t lock_offset(int list)
+{
+	return FREELIST_TOP + 4*list;
+}
+
 /* a byte range locking function - return 0 on success
    this functions locks/unlocks 1 byte at the specified offset.
 
@@ -42,30 +133,36 @@ void tdb_setalarm_sigptr(struct tdb_context *tdb, volatile sig_atomic_t *ptr)
 
    note that a len of zero means lock to end of file
 */
-int tdb_brlock(struct tdb_context *tdb, tdb_off_t offset, 
-	       int rw_type, int lck_type, int probe, size_t len)
+int tdb_brlock(struct tdb_context *tdb,
+	       int rw_type, tdb_off_t offset, size_t len,
+	       enum tdb_lock_flags flags)
 {
-	struct flock fl;
 	int ret;
 
 	if (tdb->flags & TDB_NOLOCK) {
 		return 0;
 	}
 
+	if (flags & TDB_LOCK_MARK_ONLY) {
+		return 0;
+	}
+
 	if ((rw_type == F_WRLCK) && (tdb->read_only || tdb->traverse_read)) {
 		tdb->ecode = TDB_ERR_RDONLY;
 		return -1;
 	}
 
-	fl.l_type = rw_type;
-	fl.l_whence = SEEK_SET;
-	fl.l_start = offset;
-	fl.l_len = len;
-	fl.l_pid = 0;
+	/* Sanity check */
+	if (tdb->transaction && offset >= lock_offset(-1) && len != 0) {
+		tdb->ecode = TDB_ERR_RDONLY;
+		TDB_LOG((tdb, TDB_DEBUG_TRACE, "tdb_brlock attempted in transaction at offset %d rw_type=%d flags=%d len=%d\n",
+			 offset, rw_type, flags, (int)len));
+		return -1;
+	}
 
 	do {
-		ret = fcntl(tdb->fd,lck_type,&fl);
-
+		ret = fcntl_lock(tdb, rw_type, offset, len,
+				 flags & TDB_LOCK_WAIT);
 		/* Check for a sigalarm break. */
 		if (ret == -1 && errno == EINTR &&
 				tdb->interrupt_sig_ptr &&
@@ -79,15 +176,34 @@ int tdb_brlock(struct tdb_context *tdb, tdb_off_t offset,
 		/* Generic lock error. errno set by fcntl.
 		 * EAGAIN is an expected return from non-blocking
 		 * locks. */
-		if (!probe && lck_type != F_SETLK) {
-			TDB_LOG((tdb, TDB_DEBUG_TRACE,"tdb_brlock failed (fd=%d) at offset %d rw_type=%d lck_type=%d len=%d\n", 
-				 tdb->fd, offset, rw_type, lck_type, (int)len));
+		if (!(flags & TDB_LOCK_PROBE) && errno != EAGAIN) {
+			TDB_LOG((tdb, TDB_DEBUG_TRACE,"tdb_brlock failed (fd=%d) at offset %d rw_type=%d flags=%d len=%d\n",
+				 tdb->fd, offset, rw_type, flags, (int)len));
 		}
 		return -1;
 	}
 	return 0;
 }
 
+int tdb_brunlock(struct tdb_context *tdb,
+		 int rw_type, tdb_off_t offset, size_t len)
+{
+	int ret;
+
+	if (tdb->flags & TDB_NOLOCK) {
+		return 0;
+	}
+
+	do {
+		ret = fcntl_unlock(tdb, rw_type, offset, len);
+	} while (ret == -1 && errno == EINTR);
+
+	if (ret == -1) {
+		TDB_LOG((tdb, TDB_DEBUG_TRACE,"tdb_brunlock failed (fd=%d) at offset %d rw_type=%d len=%d\n",
+			 tdb->fd, offset, rw_type, (int)len));
+	}
+	return ret;
+}
 
 /*
   upgrade a read lock to a write lock. This needs to be handled in a
@@ -95,12 +211,29 @@ int tdb_brlock(struct tdb_context *tdb, tdb_off_t offset,
   deadlock detection and claim a deadlock when progress can be
   made. For those OSes we may loop for a while.  
 */
-int tdb_brlock_upgrade(struct tdb_context *tdb, tdb_off_t offset, size_t len)
+int tdb_allrecord_upgrade(struct tdb_context *tdb)
 {
 	int count = 1000;
+
+	if (tdb->allrecord_lock.count != 1) {
+		TDB_LOG((tdb, TDB_DEBUG_ERROR,
+			 "tdb_allrecord_upgrade failed: count %u too high\n",
+			 tdb->allrecord_lock.count));
+		return -1;
+	}
+
+	if (tdb->allrecord_lock.off != 1) {
+		TDB_LOG((tdb, TDB_DEBUG_ERROR,
+			 "tdb_allrecord_upgrade failed: already upgraded?\n"));
+		return -1;
+	}
+
 	while (count--) {
 		struct timeval tv;
-		if (tdb_brlock(tdb, offset, F_WRLCK, F_SETLKW, 1, len) == 0) {
+		if (tdb_brlock(tdb, F_WRLCK, FREELIST_TOP, 0,
+			       TDB_LOCK_WAIT|TDB_LOCK_PROBE) == 0) {
+			tdb->allrecord_lock.ltype = F_WRLCK;
+			tdb->allrecord_lock.off = 0;
 			return 0;
 		}
 		if (errno != EDEADLK) {
@@ -111,57 +244,46 @@ int tdb_brlock_upgrade(struct tdb_context *tdb, tdb_off_t offset, size_t len)
 		tv.tv_usec = 1;
 		select(0, NULL, NULL, NULL, &tv);
 	}
-	TDB_LOG((tdb, TDB_DEBUG_TRACE,"tdb_brlock_upgrade failed at offset %d\n", offset));
+	TDB_LOG((tdb, TDB_DEBUG_TRACE,"tdb_allrecord_upgrade failed\n"));
 	return -1;
 }
 
-
-/* lock a list in the database. list -1 is the alloc list */
-static int _tdb_lock(struct tdb_context *tdb, int list, int ltype, int op)
+static struct tdb_lock_type *find_nestlock(struct tdb_context *tdb,
+					   tdb_off_t offset)
 {
-	struct tdb_lock_type *new_lck;
-	int i;
-	bool mark_lock = ((ltype & TDB_MARK_LOCK) == TDB_MARK_LOCK);
-
-	ltype &= ~TDB_MARK_LOCK;
+	unsigned int i;
 
-	/* a global lock allows us to avoid per chain locks */
-	if (tdb->global_lock.count && 
-	    (ltype == tdb->global_lock.ltype || ltype == F_RDLCK)) {
-		return 0;
+	for (i=0; i<tdb->num_lockrecs; i++) {
+		if (tdb->lockrecs[i].off == offset) {
+			return &tdb->lockrecs[i];
+		}
 	}
+	return NULL;
+}
 
-	if (tdb->global_lock.count) {
-		tdb->ecode = TDB_ERR_LOCK;
-		return -1;
-	}
+/* lock an offset in the database. */
+int tdb_nest_lock(struct tdb_context *tdb, uint32_t offset, int ltype,
+		  enum tdb_lock_flags flags)
+{
+	struct tdb_lock_type *new_lck;
 
-	if (list < -1 || list >= (int)tdb->header.hash_size) {
+	if (offset >= lock_offset(tdb->header.hash_size)) {
 		tdb->ecode = TDB_ERR_LOCK;
-		TDB_LOG((tdb, TDB_DEBUG_ERROR,"tdb_lock: invalid list %d for ltype=%d\n", 
-			   list, ltype));
+		TDB_LOG((tdb, TDB_DEBUG_ERROR,"tdb_lock: invalid offset %u for ltype=%d\n",
+			 offset, ltype));
 		return -1;
 	}
 	if (tdb->flags & TDB_NOLOCK)
 		return 0;
 
-	for (i=0; i<tdb->num_lockrecs; i++) {
-		if (tdb->lockrecs[i].list == list) {
-			if (tdb->lockrecs[i].count == 0) {
-				/*
-				 * Can't happen, see tdb_unlock(). It should
-				 * be an assert.
-				 */
-				TDB_LOG((tdb, TDB_DEBUG_ERROR, "tdb_lock: "
-					 "lck->count == 0 for list %d", list));
-			}
-			/*
-			 * Just increment the in-memory struct, posix locks
-			 * don't stack.
-			 */
-			tdb->lockrecs[i].count++;
-			return 0;
-		}
+	new_lck = find_nestlock(tdb, offset);
+	if (new_lck) {
+		/*
+		 * Just increment the in-memory struct, posix locks
+		 * don't stack.
+		 */
+		new_lck->count++;
+		return 0;
 	}
 
 	new_lck = (struct tdb_lock_type *)realloc(
@@ -175,27 +297,89 @@ static int _tdb_lock(struct tdb_context *tdb, int list, int ltype, int op)
 
 	/* Since fcntl locks don't nest, we do a lock for the first one,
 	   and simply bump the count for future ones */
-	if (!mark_lock &&
-	    tdb->methods->tdb_brlock(tdb,FREELIST_TOP+4*list, ltype, op,
-				     0, 1)) {
+	if (tdb_brlock(tdb, ltype, offset, 1, flags)) {
 		return -1;
 	}
 
-	tdb->num_locks++;
-
-	tdb->lockrecs[tdb->num_lockrecs].list = list;
+	tdb->lockrecs[tdb->num_lockrecs].off = offset;
 	tdb->lockrecs[tdb->num_lockrecs].count = 1;
 	tdb->lockrecs[tdb->num_lockrecs].ltype = ltype;
-	tdb->num_lockrecs += 1;
+	tdb->num_lockrecs++;
 
 	return 0;
 }
 
+static int tdb_lock_and_recover(struct tdb_context *tdb)
+{
+	int ret;
+
+	/* We need to match locking order in transaction commit. */
+	if (tdb_brlock(tdb, F_WRLCK, FREELIST_TOP, 0, TDB_LOCK_WAIT)) {
+		return -1;
+	}
+
+	if (tdb_brlock(tdb, F_WRLCK, OPEN_LOCK, 1, TDB_LOCK_WAIT)) {
+		tdb_brunlock(tdb, F_WRLCK, FREELIST_TOP, 0);
+		return -1;
+	}
+
+	ret = tdb_transaction_recover(tdb);
+
+	tdb_brunlock(tdb, F_WRLCK, OPEN_LOCK, 1);
+	tdb_brunlock(tdb, F_WRLCK, FREELIST_TOP, 0);
+
+	return ret;
+}
+
+static bool have_data_locks(const struct tdb_context *tdb)
+{
+	unsigned int i;
+
+	for (i = 0; i < tdb->num_lockrecs; i++) {
+		if (tdb->lockrecs[i].off >= lock_offset(-1))
+			return true;
+	}
+	return false;
+}
+
+static int tdb_lock_list(struct tdb_context *tdb, int list, int ltype,
+			 enum tdb_lock_flags waitflag)
+{
+	int ret;
+	bool check = false;
+
+	/* a allrecord lock allows us to avoid per chain locks */
+	if (tdb->allrecord_lock.count &&
+	    (ltype == tdb->allrecord_lock.ltype || ltype == F_RDLCK)) {
+		return 0;
+	}
+
+	if (tdb->allrecord_lock.count) {
+		tdb->ecode = TDB_ERR_LOCK;
+		ret = -1;
+	} else {
+		/* Only check when we grab first data lock. */
+		check = !have_data_locks(tdb);
+		ret = tdb_nest_lock(tdb, lock_offset(list), ltype, waitflag);
+
+		if (ret == 0 && check && tdb_needs_recovery(tdb)) {
+			tdb_nest_unlock(tdb, lock_offset(list), ltype, false);
+
+			if (tdb_lock_and_recover(tdb) == -1) {
+				return -1;
+			}
+			return tdb_lock_list(tdb, list, ltype, waitflag);
+		}
+	}
+	return ret;
+}
+
 /* lock a list in the database. list -1 is the alloc list */
 int tdb_lock(struct tdb_context *tdb, int list, int ltype)
 {
 	int ret;
-	ret = _tdb_lock(tdb, list, ltype, F_SETLKW);
+
+	ret = tdb_lock_list(tdb, list, ltype, TDB_LOCK_WAIT);
 	if (ret) {
 		TDB_LOG((tdb, TDB_DEBUG_ERROR, "tdb_lock failed on list %d "
 			 "ltype=%d (%s)\n",  list, ltype, strerror(errno)));
@@ -206,49 +390,26 @@ int tdb_lock(struct tdb_context *tdb, int list, int ltype)
 /* lock a list in the database. list -1 is the alloc list. non-blocking lock */
 int tdb_lock_nonblock(struct tdb_context *tdb, int list, int ltype)
 {
-	return _tdb_lock(tdb, list, ltype, F_SETLK);
+	return tdb_lock_list(tdb, list, ltype, TDB_LOCK_NOWAIT);
 }
 
 
-/* unlock the database: returns void because it's too late for errors. */
-	/* changed to return int it may be interesting to know there
-	   has been an error  --simo */
-int tdb_unlock(struct tdb_context *tdb, int list, int ltype)
+int tdb_nest_unlock(struct tdb_context *tdb, uint32_t offset, int ltype,
+		    bool mark_lock)
 {
 	int ret = -1;
-	int i;
-	struct tdb_lock_type *lck = NULL;
-	bool mark_lock = ((ltype & TDB_MARK_LOCK) == TDB_MARK_LOCK);
-
-	ltype &= ~TDB_MARK_LOCK;
-
-	/* a global lock allows us to avoid per chain locks */
-	if (tdb->global_lock.count && 
-	    (ltype == tdb->global_lock.ltype || ltype == F_RDLCK)) {
-		return 0;
-	}
-
-	if (tdb->global_lock.count) {
-		tdb->ecode = TDB_ERR_LOCK;
-		return -1;
-	}
+	struct tdb_lock_type *lck;
 
 	if (tdb->flags & TDB_NOLOCK)
 		return 0;
 
 	/* Sanity checks */
-	if (list < -1 || list >= (int)tdb->header.hash_size) {
-		TDB_LOG((tdb, TDB_DEBUG_ERROR, "tdb_unlock: list %d invalid (%d)\n", list, tdb->header.hash_size));
+	if (offset >= lock_offset(tdb->header.hash_size)) {
+		TDB_LOG((tdb, TDB_DEBUG_ERROR, "tdb_unlock: offset %u invalid (%d)\n", offset, tdb->header.hash_size));
 		return ret;
 	}
 
-	for (i=0; i<tdb->num_lockrecs; i++) {
-		if (tdb->lockrecs[i].list == list) {
-			lck = &tdb->lockrecs[i];
-			break;
-		}
-	}
-
+	lck = find_nestlock(tdb, offset);
 	if ((lck == NULL) || (lck->count == 0)) {
 		TDB_LOG((tdb, TDB_DEBUG_ERROR, "tdb_unlock: count is 0\n"));
 		return -1;
@@ -269,20 +430,14 @@ int tdb_unlock(struct tdb_context *tdb, int list, int ltype)
 	if (mark_lock) {
 		ret = 0;
 	} else {
-		ret = tdb->methods->tdb_brlock(tdb, FREELIST_TOP+4*list, F_UNLCK,
-					       F_SETLKW, 0, 1);
+		ret = tdb_brunlock(tdb, ltype, offset, 1);
 	}
-	tdb->num_locks--;
 
 	/*
 	 * Shrink the array by overwriting the element just unlocked with the
 	 * last array element.
 	 */
-
-	if (tdb->num_lockrecs > 1) {
-		*lck = tdb->lockrecs[tdb->num_lockrecs-1];
-	}
-	tdb->num_lockrecs -= 1;
+	*lck = tdb->lockrecs[--tdb->num_lockrecs];
 


-- 
Samba Shared Repository