[PATCH] [BUG# 12858] Re: [WIP][PATCH] Safe LDB locking for better replication

Andrew Bartlett abartlet at samba.org
Sun Jun 25 22:29:54 UTC 2017


On Fri, 2017-06-23 at 12:45 +0200, Stefan Metzmacher wrote:
> Am 23.06.2017 um 04:44 schrieb Andrew Bartlett via samba-technical:
> > On Fri, 2017-06-23 at 09:25 +1200, Andrew Bartlett via samba-technical
> > wrote:
> > > On Thu, 2017-06-22 at 11:08 +0200, Stefan Metzmacher wrote:
> > > > Am 22.06.2017 um 10:36 schrieb Andrew Bartlett:
> > > > > 
> > > > > Without wishing to derail this, I would note that this does change the
> > > > > ldb module ABI (which we never promised anyway),
> > > > 
> > > > I already changed the commit to be ldb-1.2.0.
> > > > 
> > > > > and also is not
> > > > > backwards compatible with Samba < 4.7.
> > > > 
> > > > Why and what to you mean exactly?
> > > > Building Samba 4.6 with the new ldb version?
> > > 
> > > Yes.
> > > 
> > > For the list archives: If you use an old Samba (or old anything,
> > > without a hack forcing an override) and don't rebuild, they will fail
> > > to operate as we have a check.  If you override, then you will supply a
> > > too short modules structure.  But you know all this.
> > > 
> > > So the answer is yes, Samba 4.6 will build fine, but then fail as
> > > below:
> > > 
> > > > > It is likely to trigger deadlock detected "failed to upgrade hash
> > > > > locks", LDB_ERR_BUSY on earlier Samba versions, as they came to not
> > > > > expect locking (and the required lock ordering) to be enforced on
> > > > > ldb_tdb.
> > > > 
> > > > Can you give an example commit that is missing in earlier Samba
> > > > versions?
> > > 
> > > commit 10e7c749e7b7a4155669c6ecf97a9ac52b13110a
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 31 12:22:28 2017 +1200
> > > 
> > >     dsdb: Correctly call ldb_module_done in dsdb_notification
> > > 
> > > For the 1.1.30 ldb version, the even context changes require this in
> > > Samba (less important, only hits -M single):
> > > 
> > > commit 7259661467776a76c4fa3aabaf1ae8a3d531e506
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Fri May 12 01:55:45 2017 +0200
> > > 
> > >     dsdb: Use ldb_handle_use_global_event_context for rootdse modifies
> > >     
> > >     The modify operations on the rootDSE turn into IRPC messages, and
> > > these need
> > >     to be handled on the global event context, not the per-operation
> > > context
> > > 
> > > 
> > > And this probably only fails in make test and causes noise (tried to do
> > > transaction writes during a search):
> > > 
> > > commit 5f0e53f1b90369c649688122c0a8742352f13877
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 3 22:53:14 2017 +0200
> > > 
> > >     dsdb: Do not write the @INDEXLIST or @ATTRIBUTES records during
> > > schema refresh
> > >     
> > >     Instead, write it once in the module init, if required, and after a
> > >     modify to the schema partition is detected
> > > 
> > > 
> > > But this will fail in normal operation:
> > > 
> > > commit b8ba0103bf45670c31384c56d6cd63bbef760a0c
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 3 22:51:30 2017 +0200
> > > 
> > >     dsdb: Take out the transaction and prepare_commit locks in the same
> > > order
> > >     
> > >     We must, when starting the transaction and preparing to commit the
> > > transaction, take
> > >     out the locks in the same order across all the databases, otherwise
> > > we may deadlock.
> > 
> > Giving this reasonable question, I do wish I had tagged the bugs, but I
> > didn't.  However I've filed a clear bug now, so please add 
> > 
> > BUG: https://bugzilla.samba.org/show_bug.cgi?id=12858
> > 
> > to any commits not yet in master.
> > 
> > > > > Naturally, this is balanced by the fact that we absolutely have to
> > > > > merge this, as the read and replication corruption it caused has been
> > > > > observed in the real world, not just our selftest.
> > > > 
> > > > Yes!
> > > > 
> > > > I've looked through the patches and found two things:
> > > > 
> > > > 1)
> > > >  "ldb: Add test encoding current locking behaviour during ldb_search()"
> > > >  should include the lib/ldb/tests/ldb_mod_op_test.c changes
> > > >  from "ldb: Lock the whole backend database for the duration of a search"
> > > >  Or I'm I missing something?
> > > 
> > > No, what I tried to do is to show what was the behaviour, then fix the
> > > test for the new behaviour when fixing the code.
> > 
> > To be clear, the idea is that the tests do pass in each commit.
> 
> I changed it down to just the basic change that
> 
> assert_int_equal(res_count, 1);
> =>
> assert_int_equal(res_count, 0);
> 
> everything else is squashed to the initial commit.
> 
> > > I would have done the knownfail trick but we don't have that for ldb.
> > > 
> > > > 2)
> > > >  "dsdb: Teach the Samba partition module how to lock all the DB backends"
> > > >  got the unlocking order wrong, which must match
> > > >  {end,del}_trans() instead of read_lock().
> > > 
> > > Well spotted, I agree we want to unlock sam.ldb last.
> > > 
> > > >  The key is that the read locks on the individual partitions
> > > >  are purely a performance optimization to avoid
> > > >  fcntl syscalls, as these looks are never contented,
> > > >  because if the "BIG-Lock" on the metadata (sam.ldb).
> > > > 
> > > > I'm starting private autobuilds with the attached patchset.
> 
> 3 autobuilds passed fine.
> 
> However I looked closer at this commit
> "dsdb: Add test showing that the whole DB (including partitions) is
> locked during a search"
> https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=07afa3c48a1f
> 
> And I noticed that the code is not testing what the commit message says.
> It's testing that a search can't start while a prepared transaction is
> in progress.
> 
> I changed the test_db_lock1() to use a full SamDB in the child
> instead of going directly to the raw sam.ldb backend.
> As this simulates the real world.
> 
> I added tests which really test if the prepare_commit is blocked by
> a search. And that test pass with the full patchset.
> 
> But it fails unexpected without the "dsdb: Teach the Samba partition
> module" commit (or if I remove the per partition locks from it).
> 
> UNEXPECTED(failure):
> samba.tests.dsdb.samba.tests.dsdb.DsdbTests.test_db_lock2(ad_dc_ntvfs:local)
> REASON: Exception: Exception: Traceback (most recent call last):
>   File
> "/home/metze/devel/samba/4.0/master4-drsuapi/bin/python/samba/tests/dsdb.py",
> line 240, in test_db_lock2
>     self.assertGreater(end - start, 1.9)
> AssertionError: 0.0003879070281982422 not greater than 1.9
> 
> So it seems that the read_lock doesn't reach the tdb backend for sam.ldb
> 
> Can you please debug this?
> We need to understand this behavior and fix it.

I've taken some more time than I had on Saturday to look over this more
carefully:

I think the issue is that the test isn't able to create a write lock on
sam.ldb because it only makes a modification to
DC=SAMBA,DC=EXAMPLE,DC=COM.ldb and this means that 

_tdb_transaction_prepare_commit() will just do:

	/* check for a null transaction */
	if (tdb->transaction->blocks == NULL) {
		return 0;
	}

To test what you want to test, we should add two more additional tests,
that confirm that adding an @ record to sam.ldb operates correctly
(these go into the top partition), and adding a record to the
CN=CONFIGURUATION backend db blocks a search, showing that it isn't
just the traverse search read lock that is blocking it.  (It isn't, but
proof would be nice). 

My other comments around the assert() calls in the child still hold,
but these were not what was breaking it.  

I expect to get back to you later today with these improved tests.

Thanks!

Andrew Bartlett

-- 
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba




More information about the samba-technical mailing list