[PATCH] [BUG# 12858] Re: [WIP][PATCH] Safe LDB locking for better replication
Andrew Bartlett
abartlet at samba.org
Fri Jun 23 20:21:14 UTC 2017
On Fri, 2017-06-23 at 12:45 +0200, Stefan Metzmacher wrote:
> Am 23.06.2017 um 04:44 schrieb Andrew Bartlett via samba-technical:
> > On Fri, 2017-06-23 at 09:25 +1200, Andrew Bartlett via samba-technical
> > wrote:
> > > On Thu, 2017-06-22 at 11:08 +0200, Stefan Metzmacher wrote:
> > > > Am 22.06.2017 um 10:36 schrieb Andrew Bartlett:
> > > > >
> > > > > Without wishing to derail this, I would note that this does change the
> > > > > ldb module ABI (which we never promised anyway),
> > > >
> > > > I already changed the commit to be ldb-1.2.0.
> > > >
> > > > > and also is not
> > > > > backwards compatible with Samba < 4.7.
> > > >
> > > > Why and what to you mean exactly?
> > > > Building Samba 4.6 with the new ldb version?
> > >
> > > Yes.
> > >
> > > For the list archives: If you use an old Samba (or old anything,
> > > without a hack forcing an override) and don't rebuild, they will fail
> > > to operate as we have a check. If you override, then you will supply a
> > > too short modules structure. But you know all this.
> > >
> > > So the answer is yes, Samba 4.6 will build fine, but then fail as
> > > below:
> > >
> > > > > It is likely to trigger deadlock detected "failed to upgrade hash
> > > > > locks", LDB_ERR_BUSY on earlier Samba versions, as they came to not
> > > > > expect locking (and the required lock ordering) to be enforced on
> > > > > ldb_tdb.
> > > >
> > > > Can you give an example commit that is missing in earlier Samba
> > > > versions?
> > >
> > > commit 10e7c749e7b7a4155669c6ecf97a9ac52b13110a
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date: Wed May 31 12:22:28 2017 +1200
> > >
> > > dsdb: Correctly call ldb_module_done in dsdb_notification
> > >
> > > For the 1.1.30 ldb version, the even context changes require this in
> > > Samba (less important, only hits -M single):
> > >
> > > commit 7259661467776a76c4fa3aabaf1ae8a3d531e506
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date: Fri May 12 01:55:45 2017 +0200
> > >
> > > dsdb: Use ldb_handle_use_global_event_context for rootdse modifies
> > >
> > > The modify operations on the rootDSE turn into IRPC messages, and
> > > these need
> > > to be handled on the global event context, not the per-operation
> > > context
> > >
> > >
> > > And this probably only fails in make test and causes noise (tried to do
> > > transaction writes during a search):
> > >
> > > commit 5f0e53f1b90369c649688122c0a8742352f13877
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date: Wed May 3 22:53:14 2017 +0200
> > >
> > > dsdb: Do not write the @INDEXLIST or @ATTRIBUTES records during
> > > schema refresh
> > >
> > > Instead, write it once in the module init, if required, and after a
> > > modify to the schema partition is detected
> > >
> > >
> > > But this will fail in normal operation:
> > >
> > > commit b8ba0103bf45670c31384c56d6cd63bbef760a0c
> > > Author: Andrew Bartlett <abartlet at samba.org>
> > > Date: Wed May 3 22:51:30 2017 +0200
> > >
> > > dsdb: Take out the transaction and prepare_commit locks in the same
> > > order
> > >
> > > We must, when starting the transaction and preparing to commit the
> > > transaction, take
> > > out the locks in the same order across all the databases, otherwise
> > > we may deadlock.
> >
> > Giving this reasonable question, I do wish I had tagged the bugs, but I
> > didn't. However I've filed a clear bug now, so please add
> >
> > BUG: https://bugzilla.samba.org/show_bug.cgi?id=12858
> >
> > to any commits not yet in master.
> >
> > > > > Naturally, this is balanced by the fact that we absolutely have to
> > > > > merge this, as the read and replication corruption it caused has been
> > > > > observed in the real world, not just our selftest.
> > > >
> > > > Yes!
> > > >
> > > > I've looked through the patches and found two things:
> > > >
> > > > 1)
> > > > "ldb: Add test encoding current locking behaviour during ldb_search()"
> > > > should include the lib/ldb/tests/ldb_mod_op_test.c changes
> > > > from "ldb: Lock the whole backend database for the duration of a search"
> > > > Or I'm I missing something?
> > >
> > > No, what I tried to do is to show what was the behaviour, then fix the
> > > test for the new behaviour when fixing the code.
> >
> > To be clear, the idea is that the tests do pass in each commit.
>
> I changed it down to just the basic change that
>
> assert_int_equal(res_count, 1);
> =>
> assert_int_equal(res_count, 0);
>
> everything else is squashed to the initial commit.
As long as you tested both, I'm happy.
> > > I would have done the knownfail trick but we don't have that for ldb.
> > >
> > > > 2)
> > > > "dsdb: Teach the Samba partition module how to lock all the DB backends"
> > > > got the unlocking order wrong, which must match
> > > > {end,del}_trans() instead of read_lock().
> > >
> > > Well spotted, I agree we want to unlock sam.ldb last.
> > >
> > > > The key is that the read locks on the individual partitions
> > > > are purely a performance optimization to avoid
> > > > fcntl syscalls, as these looks are never contented,
> > > > because if the "BIG-Lock" on the metadata (sam.ldb).
> > > >
> > > > I'm starting private autobuilds with the attached patchset.
>
> 3 autobuilds passed fine.
>
> However I looked closer at this commit
> "dsdb: Add test showing that the whole DB (including partitions) is
> locked during a search"
> https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=07afa3c48a1f
>
> And I noticed that the code is not testing what the commit message says.
> It's testing that a search can't start while a prepared transaction is
> in progress.
That was deliberate. I need a write lock on a backend DB. That will
block a read lock on the backend DB, and a search, if chained to the
backend db.
> I changed the test_db_lock1() to use a full SamDB in the child
> instead of going directly to the raw sam.ldb backend.
> As this simulates the real world.
Sure, but simulating the real world wasn't what I was after: if you
lock the whole DB then missing the lock checking down to the backend
won't be noticed.
Otherwise, it is really hard to tell if we really do get a read lock on
the backend DB.
> I added tests which really test if the prepare_commit is blocked by
> a search. And that test pass with the full patchset.
>
> But it fails unexpected without the "dsdb: Teach the Samba partition
> module" commit (or if I remove the per partition locks from it).
>
> UNEXPECTED(failure):
> samba.tests.dsdb.samba.tests.dsdb.DsdbTests.test_db_lock2(ad_dc_ntvfs:local)
> REASON: Exception: Exception: Traceback (most recent call last):
> File
> "/home/metze/devel/samba/4.0/master4-drsuapi/bin/python/samba/tests/dsdb.py",
> line 240, in test_db_lock2
> self.assertGreater(end - start, 1.9)
> AssertionError: 0.0003879070281982422 not greater than 1.9
>
> So it seems that the read_lock doesn't reach the tdb backend for sam.ldb
No, I think it means you wrote a test that doesn't prove what you think
it was proving.
Looking at the patch, I think you have your timing code in the wrong
process. In any case, calling anything that gets between the fork == 0
and the os._exit(0) will cause pain, you have to do all that in the
parent, otherwise you end up with two tests running in parallel :-)
Sorry,
Andrew Bartlett
--
Andrew Bartlett http://samba.org/~abartlet/
Authentication Developer, Samba Team http://samba.org
Samba Developer, Catalyst IT http://catalyst.net.nz/services/samba
More information about the samba-technical
mailing list