[PATCH] [BUG# 12858] Re: [WIP][PATCH] Safe LDB locking for better replication

Andrew Bartlett abartlet at samba.org
Fri Jun 23 20:21:14 UTC 2017


On Fri, 2017-06-23 at 12:45 +0200, Stefan Metzmacher wrote:
> Am 23.06.2017 um 04:44 schrieb Andrew Bartlett via samba-technical:
> > On Fri, 2017-06-23 at 09:25 +1200, Andrew Bartlett via samba-technical
> > wrote:
> > > On Thu, 2017-06-22 at 11:08 +0200, Stefan Metzmacher wrote:
> > > > Am 22.06.2017 um 10:36 schrieb Andrew Bartlett:
> > > > > 
> > > > > Without wishing to derail this, I would note that this does change the
> > > > > ldb module ABI (which we never promised anyway),
> > > > 
> > > > I already changed the commit to be ldb-1.2.0.
> > > > 
> > > > > and also is not
> > > > > backwards compatible with Samba < 4.7.
> > > > 
> > > > Why and what to you mean exactly?
> > > > Building Samba 4.6 with the new ldb version?
> > > 
> > > Yes.
> > > 
> > > For the list archives: If you use an old Samba (or old anything,
> > > without a hack forcing an override) and don't rebuild, they will fail
> > > to operate as we have a check.  If you override, then you will supply a
> > > too short modules structure.  But you know all this.
> > > 
> > > So the answer is yes, Samba 4.6 will build fine, but then fail as
> > > below:
> > > 
> > > > > It is likely to trigger deadlock detected "failed to upgrade hash
> > > > > locks", LDB_ERR_BUSY on earlier Samba versions, as they came to not
> > > > > expect locking (and the required lock ordering) to be enforced on
> > > > > ldb_tdb.
> > > > 
> > > > Can you give an example commit that is missing in earlier Samba
> > > > versions?
> > > 
> > > commit 10e7c749e7b7a4155669c6ecf97a9ac52b13110a
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 31 12:22:28 2017 +1200
> > > 
> > >     dsdb: Correctly call ldb_module_done in dsdb_notification
> > > 
> > > For the 1.1.30 ldb version, the even context changes require this in
> > > Samba (less important, only hits -M single):
> > > 
> > > commit 7259661467776a76c4fa3aabaf1ae8a3d531e506
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Fri May 12 01:55:45 2017 +0200
> > > 
> > >     dsdb: Use ldb_handle_use_global_event_context for rootdse modifies
> > >     
> > >     The modify operations on the rootDSE turn into IRPC messages, and
> > > these need
> > >     to be handled on the global event context, not the per-operation
> > > context
> > > 
> > > 
> > > And this probably only fails in make test and causes noise (tried to do
> > > transaction writes during a search):
> > > 
> > > commit 5f0e53f1b90369c649688122c0a8742352f13877
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 3 22:53:14 2017 +0200
> > > 
> > >     dsdb: Do not write the @INDEXLIST or @ATTRIBUTES records during
> > > schema refresh
> > >     
> > >     Instead, write it once in the module init, if required, and after a
> > >     modify to the schema partition is detected
> > > 
> > > 
> > > But this will fail in normal operation:
> > > 
> > > commit b8ba0103bf45670c31384c56d6cd63bbef760a0c
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date:   Wed May 3 22:51:30 2017 +0200
> > > 
> > >     dsdb: Take out the transaction and prepare_commit locks in the same
> > > order
> > >     
> > >     We must, when starting the transaction and preparing to commit the
> > > transaction, take
> > >     out the locks in the same order across all the databases, otherwise
> > > we may deadlock.
> > 
> > Giving this reasonable question, I do wish I had tagged the bugs, but I
> > didn't.  However I've filed a clear bug now, so please add 
> > 
> > BUG: https://bugzilla.samba.org/show_bug.cgi?id=12858
> > 
> > to any commits not yet in master.
> > 
> > > > > Naturally, this is balanced by the fact that we absolutely have to
> > > > > merge this, as the read and replication corruption it caused has been
> > > > > observed in the real world, not just our selftest.
> > > > 
> > > > Yes!
> > > > 
> > > > I've looked through the patches and found two things:
> > > > 
> > > > 1)
> > > >  "ldb: Add test encoding current locking behaviour during ldb_search()"
> > > >  should include the lib/ldb/tests/ldb_mod_op_test.c changes
> > > >  from "ldb: Lock the whole backend database for the duration of a search"
> > > >  Or I'm I missing something?
> > > 
> > > No, what I tried to do is to show what was the behaviour, then fix the
> > > test for the new behaviour when fixing the code.
> > 
> > To be clear, the idea is that the tests do pass in each commit.
> 
> I changed it down to just the basic change that
> 
> assert_int_equal(res_count, 1);
> =>
> assert_int_equal(res_count, 0);
> 
> everything else is squashed to the initial commit.

As long as you tested both, I'm happy.

> > > I would have done the knownfail trick but we don't have that for ldb.
> > > 
> > > > 2)
> > > >  "dsdb: Teach the Samba partition module how to lock all the DB backends"
> > > >  got the unlocking order wrong, which must match
> > > >  {end,del}_trans() instead of read_lock().
> > > 
> > > Well spotted, I agree we want to unlock sam.ldb last.
> > > 
> > > >  The key is that the read locks on the individual partitions
> > > >  are purely a performance optimization to avoid
> > > >  fcntl syscalls, as these looks are never contented,
> > > >  because if the "BIG-Lock" on the metadata (sam.ldb).
> > > > 
> > > > I'm starting private autobuilds with the attached patchset.
> 
> 3 autobuilds passed fine.
> 
> However I looked closer at this commit
> "dsdb: Add test showing that the whole DB (including partitions) is
> locked during a search"
> https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=07afa3c48a1f
> 
> And I noticed that the code is not testing what the commit message says.
> It's testing that a search can't start while a prepared transaction is
> in progress.

That was deliberate.  I need a write lock on a backend DB.  That will
block a read lock on the backend DB, and a search, if chained to the
backend db.

> I changed the test_db_lock1() to use a full SamDB in the child
> instead of going directly to the raw sam.ldb backend.
> As this simulates the real world.

Sure, but simulating the real world wasn't what I was after: if you
lock the whole DB then missing the lock checking down to the backend
won't be noticed.

Otherwise, it is really hard to tell if we really do get a read lock on
the backend DB.

> I added tests which really test if the prepare_commit is blocked by
> a search. And that test pass with the full patchset.
> 
> But it fails unexpected without the "dsdb: Teach the Samba partition
> module" commit (or if I remove the per partition locks from it).
> 
> UNEXPECTED(failure):
> samba.tests.dsdb.samba.tests.dsdb.DsdbTests.test_db_lock2(ad_dc_ntvfs:local)
> REASON: Exception: Exception: Traceback (most recent call last):
>   File
> "/home/metze/devel/samba/4.0/master4-drsuapi/bin/python/samba/tests/dsdb.py",
> line 240, in test_db_lock2
>     self.assertGreater(end - start, 1.9)
> AssertionError: 0.0003879070281982422 not greater than 1.9
> 
> So it seems that the read_lock doesn't reach the tdb backend for sam.ldb

No, I think it means you wrote a test that doesn't prove what you think
it was proving.

Looking at the patch, I think you have your timing code in the wrong
process.  In any case, calling anything that gets between the fork == 0
and the os._exit(0) will cause pain, you have to do all that in the
parent, otherwise you end up with two tests running in parallel :-)

Sorry,

Andrew Bartlett
-- 
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba




More information about the samba-technical mailing list