[Samba] CTDB Question: external locking tool

Thu Oct 29 13:26:05 UTC 2020

Hi Martin, excellent write-up. This is a fun topic to discuss, and
important to our use-cases and architecture.

I've been re-reading the thread. Giving it more thought.

I agree, the consistency check really makes no sense to me. I think you
were suggesting similar thoughts?

I wanted to share a couple points before returning to the topic...

*About a better architecture, the topic we spoke of in private...*

We've spoken on CTDB/Samba adopting more modern clustering techniques (for
me this means RAFT); see [Canonical RAFT](https://github.com/canonical/raft)
. In such a model I would think we would have two classes of nodes,
electable and non-electable nodes. With regards to recovery and
data-safety, only the leader-electable nodes would be required to be
consistent.

>From the point of view of integrity and availability, consensus would be
defined as a majority of electable nodes being reachable, not non-electable
nodes. Should any node find itself in the minority (due to process failures
or network partitions) they should shut down immediately (again to ensure
data integrity); this applies to both non-electable and electable alike.

With this model in mind, the only nodes that should attempt to get a lock,
are the electable nodes. Non-electable nodes certainly interact with
electable, but distributed cluster states are purely ephemeral (they still
interact with leaders to determine record locations, but any caching is
purely ephemeral e.g.). Possibly, non-electable nodes could by means of
running locator code, when a record is resolved, it could register itself
for notifications for record changes until such time as the records age out
(LRU). Electable nodes follow standard RAFT semantics, vie for leadership,
recovery, etc; nuances abound, such as if one leader actually has more
current state than another (choices to disregard and favor the most recent
data from a consensus point of view, or accept the leader with the most up
to date state as the true authority and it proceeds with the leader lock
and performs the recovery, and read-repairs the other leaders).

*Returning to the current architecture...*

I was hoping that leadership semantics (lmaster + recmaster) would follow
that any nodes (must an odd number greater than 1) participating as such
would always be those that would EVER be permitted to take a lock. That in
fact lmaster+recmaster == "leader electable", otherwise not. And that every
non-electable node, the entire recovery lock code should not be called, and
in fact, until such time as a leader informs it that recovery has
completed, the startup procedure for a "follower" is suspended (sleeps).
This would really speak to the ability of adding new nodes for increased
data safety and service availability, but these require leadership
semantics we've spoken of privately.

In your writeup you say,

*"At the beginning of each recovery each node checks that its recovery lock
setting is consistent with that of the recovery master"*

I take this to mean that the recovery lock entry in ctdb.conf is
consistent. This is true, and I ran into this. I would be even more
explicitly saying that "the recovery lock lines in ctdb.conf MUST MATCH on
each participating server."

I actually don't like this aspect of CTDB, meaning, I disagree with the
choice to do so. Two strings may have different syntax, but have identical
semantics; such is the case with configuration, frequently. In the case of
CTDB, the command line should be permitted to be different; for example,
when using etcd, the address for the nearest etcd node should be specified,
esp if etcd is run in a geo-distributed fashion. The work-around we chose
was to put all the configuration in a separate config.yaml file for our
tool, where the addresses can differ, and still satisfy the constraint in
CTDB that the lines be identical. This check really has no value; for the
naive user it makes it idiot proof, but for the seasoned team that has this
all automated, such errors never/rarely occur and are easily identifiable.

You also said,

*"At the end of each recovery each node attempts to take the recovery lock
and expects this to fail (because the lock is already held by another
process). These checks are done unconditionally when the recovery lock is
set, without regard to node capabilities (see CAPABILITIES below)."*

Is this necessary? What would be the issue if non-electable nodes simply
waits till the end of recovery? I am guessing CTDB communicates to its
peers, and the recovering node once done informs all other nodes that it's
complete, up, and status is normal (so they can continue with joining the
cluster). I don't understand the goal for this check if lmaster and
recmaster (leadership) is turned off.

Would love to hear your thoughts, and we can do another Zoom to discuss
with Amitay if you wish.

Bob

On Thu, Oct 29, 2020 at 5:31 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Bob,
>
> On Tue, 27 Oct 2020 15:09:34 +1100, Martin Schwenke via samba
> <samba at lists.samba.org> wrote:
>
> > On Sun, 25 Oct 2020 20:44:07 -0400, Robert Buck <robert.buck at som.com>
> > wrote:
> >
> > > We use a Golang-based lock tool that we wrote for CTDB. That tool
> interacts
> > > with our 3.4 etcd cluster, and follows the requirements specified in
> the
> > > project.
> > >
> > > Question, does the external command line tool get called when LMASTER
> and
> > > RECMASTER are false? Given a scenario where we have a set of processes
> that
> > > have it set to false, then others that have it set to true, does the
> > > locking tool get called when they're set to false?
> >
> > Indeed it does.  There are 2 current things conspiring against you:
> >
> > * At the start of each recovery a recovery lock consistency check is
> >   done. Unfortunately, this means the recovery lock can't be left unset
> >   on nodes that do not have the recmaster capability because then the
> >   consistency check would fail.
> >
> > * At the end of recovery, if the recovery lock is set, all nodes will
> >   attempt to take the recovery lock and will expect to fail (on the
> >   leader/master too, since it is being taken from a different process on
> >   the leader/master).
> >
> >   This is meant to be a sanity check but, to be honest, I'm not sure
> >   whether it really adds any value.  A better option might be to only
> >   accept recovery-related controls from the current leader/master node,
> >   banning any other node that is stupid enough to send such a control.
> >
> > I need to think about his more...
> >
> > One of the problems is that the ideas of recovery master and recovery
> > lock are historical and they are somewhat dated compared to current
> > clustering concepts. Recovery master should really be "cluster leader"
> > and the lock should be "cluster lock".  If we clearly change our
> > approach in that direction then it makes no sense to check a cluster
> > lock at recovery time.
> >
> > I have a branch that does the technical (but not documentation) parts
> > of switching to cluster leader and lock... but more work is needed
> > before this is ready to merge.
> >
> > > IF you say the lock tool still gets called in both cases, then the docs
> > > need to be updated, and we on our end need to add a special config file
> > > option to reject lock acquisitions from those nodes that have the CTDB
> > > options set to false, permitting only those nodes set to true to
> acquire
> > > etcd locks.
> >
> > Well, the documentation (ctdb(7) manual page) does say:
> >
> >   CTDB does sanity checks to ensure that the recovery lock is held as
> >   expected.
> >
> > ;-)
> >
> > OK, that's pretty weak!
> >
> > I'll try to get some of Amitay's time to discuss what we should do
> > here...
>
> There are a few possible changes but none of them would really fix
> things properly.  We have situation where the recovery lock (which used
> to be released at the end of each recovery but is now released on
> election loss) is almost a cluster lock, so we really shouldn't be
> sanity checking it at the end of recovery.  However, there's currently
> no other sane place to check it.
>
> So, as you say, the docs need to be updated.  Do you think it would be
> enough to add to the single sentence above from ctdb(7)?  Something
> like the following:
>
>   CTDB does some recovery lock sanity checks. At the beginning of
>   each recovery each node checks that its recovery lock setting is
>   consistent with that of the recovery master.  At the end of each
>   recovery each node attempts to take the recovery lock and expects
>   this to fail (because the lock is already held by another process).
>   These checks are done unconditionally when the recovery lock is
>   set, without regard to node capabilities (see CAPABILITIES below).
>
> How's that?
>
> Thanks...
>
> peace & happiness,
> martin
>
>

-- 

BOB BUCK
SENIOR PLATFORM SOFTWARE ENGINEER

SKIDMORE, OWINGS & MERRILL
7 WORLD TRADE CENTER
250 GREENWICH STREET
NEW YORK, NY 10007
T  (212) 298-9624
ROBERT.BUCK at SOM.COM