[Samba] CTDB Question: external locking tool

Tue Feb 9 02:12:55 UTC 2021

Hi Bob,

Once again, sorry to have taken so long to reply to this...

What you say makes a lot of sense.

As I think I've mentioned before, the current version of CTDB has
evolved to where it is now and could obviously benefit from some
restructuring and redesign.  We have descriptions of most of this out
there somewhere, mostly in conference presentation.

Comments:

* We would need to think carefully about bundling the thing we
  currently call the "lmaster capability" with the notion of a node
  being electable ("recmaster capability").  I know we mentioned this
  went we spoke.  It might make sense from a logical/performance
  viewpoint but I'd like to think about it carefully.  :-)

  Electability is a clustering attribute and the lmaster capability
  is a database attribute.  Something like RAFT does tend to tie those
  types of things together via consensus and that may well be what we
  want in the long term for persistent/replicated databases, rather than
  using simple sequence numbers that we now use for those.  The
  lmaster capability is only used for distributed databases, so we need
  to consider how separate to keep that.  This becomes more interesting
  when we start adding newer database models, such as partially
  replicated distributed databases that might be needed for the witness
  protocol.

* Right now the "electable" flag (recmaster capability) is per-node
  configuration.  If we start doing as type of quorum, and if we base
  this quorum on electable nodes, then that flag needs to go into the
  nodes file so that all nodes know which nodes are electable.
  Obviously not a problem but something I hadn't fully realised until
  now... and important to remember when designing new things.

* I think we agree that the recovery lock consistency check is
  unnecessary.    We added this feature after some users wasted a lot of
  time with inconsistent configuration, so it isn't useless... just
  unnecessary.  Right now it can be worked around by using a wrapper
  around the cluster mutex helper, so I'm not sure it is a priority.

* You are correct that only electable nodes should try to take the
  recovery lock.  It the longer term, when any such lock in use is
  considered to be a cluster lock, database recovery would have no
  business at all interacting with the lock.  However, the problem with
  non-electable nodes testing whether they can take the lock at the end
  of recovery can also be worked around by using a wrapper around the
  cluster mutex helper (i.e. wrapper always fails).

For the last 2 points above...

I think the longer term solution to solving these types of issues for new
users would be something like a "ctdb config check" command.  This
could do a lot of things, including consistency checking basic and
nodes configuration across the cluster.  If a cluster/recovery lock is
in use then this could also check that the lock works.  It could fail
with errors for things that are obviously broken and warn/notice/info
for other things. A question is whether we should spend time now
removing these "features", or just encourage working around them with
wrappers. Given the history, I'm somewhat nervous about removing them
without another mechanism (such as a checker) in place.  

peace & happiness,
martin

On Thu, 29 Oct 2020 09:26:05 -0400, Robert Buck via samba
<samba at lists.samba.org> wrote:

> Hi Martin, excellent write-up. This is a fun topic to discuss, and
> important to our use-cases and architecture.
> 
> 
> 
> I've been re-reading the thread. Giving it more thought.
> 
> I agree, the consistency check really makes no sense to me. I think you
> were suggesting similar thoughts?
> 
> I wanted to share a couple points before returning to the topic...
> 
> *About a better architecture, the topic we spoke of in private...*
> 
> We've spoken on CTDB/Samba adopting more modern clustering techniques (for
> me this means RAFT); see [Canonical RAFT](https://github.com/canonical/raft)
> . In such a model I would think we would have two classes of nodes,
> electable and non-electable nodes. With regards to recovery and
> data-safety, only the leader-electable nodes would be required to be
> consistent.
> 
> From the point of view of integrity and availability, consensus would be
> defined as a majority of electable nodes being reachable, not non-electable
> nodes. Should any node find itself in the minority (due to process failures
> or network partitions) they should shut down immediately (again to ensure
> data integrity); this applies to both non-electable and electable alike.
> 
> With this model in mind, the only nodes that should attempt to get a lock,
> are the electable nodes. Non-electable nodes certainly interact with
> electable, but distributed cluster states are purely ephemeral (they still
> interact with leaders to determine record locations, but any caching is
> purely ephemeral e.g.). Possibly, non-electable nodes could by means of
> running locator code, when a record is resolved, it could register itself
> for notifications for record changes until such time as the records age out
> (LRU). Electable nodes follow standard RAFT semantics, vie for leadership,
> recovery, etc; nuances abound, such as if one leader actually has more
> current state than another (choices to disregard and favor the most recent
> data from a consensus point of view, or accept the leader with the most up
> to date state as the true authority and it proceeds with the leader lock
> and performs the recovery, and read-repairs the other leaders).
> 
> *Returning to the current architecture...*
> 
> I was hoping that leadership semantics (lmaster + recmaster) would follow
> that any nodes (must an odd number greater than 1) participating as such
> would always be those that would EVER be permitted to take a lock. That in
> fact lmaster+recmaster == "leader electable", otherwise not. And that every
> non-electable node, the entire recovery lock code should not be called, and
> in fact, until such time as a leader informs it that recovery has
> completed, the startup procedure for a "follower" is suspended (sleeps).
> This would really speak to the ability of adding new nodes for increased
> data safety and service availability, but these require leadership
> semantics we've spoken of privately.
> 
> In your writeup you say,
> 
> *"At the beginning of each recovery each node checks that its recovery lock
> setting is consistent with that of the recovery master"*
> 
> I take this to mean that the recovery lock entry in ctdb.conf is
> consistent. This is true, and I ran into this. I would be even more
> explicitly saying that "the recovery lock lines in ctdb.conf MUST MATCH on
> each participating server."
> 
> I actually don't like this aspect of CTDB, meaning, I disagree with the
> choice to do so. Two strings may have different syntax, but have identical
> semantics; such is the case with configuration, frequently. In the case of
> CTDB, the command line should be permitted to be different; for example,
> when using etcd, the address for the nearest etcd node should be specified,
> esp if etcd is run in a geo-distributed fashion. The work-around we chose
> was to put all the configuration in a separate config.yaml file for our
> tool, where the addresses can differ, and still satisfy the constraint in
> CTDB that the lines be identical. This check really has no value; for the
> naive user it makes it idiot proof, but for the seasoned team that has this
> all automated, such errors never/rarely occur and are easily identifiable.
> 
> You also said,
> 
> *"At the end of each recovery each node attempts to take the recovery lock
> and expects this to fail (because the lock is already held by another
> process). These checks are done unconditionally when the recovery lock is
> set, without regard to node capabilities (see CAPABILITIES below)."*
> 
> Is this necessary? What would be the issue if non-electable nodes simply
> waits till the end of recovery? I am guessing CTDB communicates to its
> peers, and the recovering node once done informs all other nodes that it's
> complete, up, and status is normal (so they can continue with joining the
> cluster). I don't understand the goal for this check if lmaster and
> recmaster (leadership) is turned off.
> 
> Would love to hear your thoughts, and we can do another Zoom to discuss
> with Amitay if you wish.
> 
> Bob
> 
> On Thu, Oct 29, 2020 at 5:31 AM Martin Schwenke <martin at meltin.net> wrote:
> 
> > Hi Bob,
> >
> > On Tue, 27 Oct 2020 15:09:34 +1100, Martin Schwenke via samba
> > <samba at lists.samba.org> wrote:
> >  
> > > On Sun, 25 Oct 2020 20:44:07 -0400, Robert Buck <robert.buck at som.com>
> > > wrote:
> > >  
> > > > We use a Golang-based lock tool that we wrote for CTDB. That tool  
> > interacts  
> > > > with our 3.4 etcd cluster, and follows the requirements specified in  
> > the  
> > > > project.
> > > >
> > > > Question, does the external command line tool get called when LMASTER  
> > and  
> > > > RECMASTER are false? Given a scenario where we have a set of processes  
> > that  
> > > > have it set to false, then others that have it set to true, does the
> > > > locking tool get called when they're set to false?  
> > >
> > > Indeed it does.  There are 2 current things conspiring against you:
> > >
> > > * At the start of each recovery a recovery lock consistency check is
> > >   done. Unfortunately, this means the recovery lock can't be left unset
> > >   on nodes that do not have the recmaster capability because then the
> > >   consistency check would fail.
> > >
> > > * At the end of recovery, if the recovery lock is set, all nodes will
> > >   attempt to take the recovery lock and will expect to fail (on the
> > >   leader/master too, since it is being taken from a different process on
> > >   the leader/master).
> > >
> > >   This is meant to be a sanity check but, to be honest, I'm not sure
> > >   whether it really adds any value.  A better option might be to only
> > >   accept recovery-related controls from the current leader/master node,
> > >   banning any other node that is stupid enough to send such a control.
> > >
> > > I need to think about his more...
> > >
> > > One of the problems is that the ideas of recovery master and recovery
> > > lock are historical and they are somewhat dated compared to current
> > > clustering concepts. Recovery master should really be "cluster leader"
> > > and the lock should be "cluster lock".  If we clearly change our
> > > approach in that direction then it makes no sense to check a cluster
> > > lock at recovery time.
> > >
> > > I have a branch that does the technical (but not documentation) parts
> > > of switching to cluster leader and lock... but more work is needed
> > > before this is ready to merge.
> > >  
> > > > IF you say the lock tool still gets called in both cases, then the docs
> > > > need to be updated, and we on our end need to add a special config file
> > > > option to reject lock acquisitions from those nodes that have the CTDB
> > > > options set to false, permitting only those nodes set to true to  
> > acquire  
> > > > etcd locks.  
> > >
> > > Well, the documentation (ctdb(7) manual page) does say:
> > >
> > >   CTDB does sanity checks to ensure that the recovery lock is held as
> > >   expected.
> > >
> > > ;-)
> > >
> > > OK, that's pretty weak!
> > >
> > > I'll try to get some of Amitay's time to discuss what we should do
> > > here...  
> >
> > There are a few possible changes but none of them would really fix
> > things properly.  We have situation where the recovery lock (which used
> > to be released at the end of each recovery but is now released on
> > election loss) is almost a cluster lock, so we really shouldn't be
> > sanity checking it at the end of recovery.  However, there's currently
> > no other sane place to check it.
> >
> > So, as you say, the docs need to be updated.  Do you think it would be
> > enough to add to the single sentence above from ctdb(7)?  Something
> > like the following:
> >
> >   CTDB does some recovery lock sanity checks. At the beginning of
> >   each recovery each node checks that its recovery lock setting is
> >   consistent with that of the recovery master.  At the end of each
> >   recovery each node attempts to take the recovery lock and expects
> >   this to fail (because the lock is already held by another process).
> >   These checks are done unconditionally when the recovery lock is
> >   set, without regard to node capabilities (see CAPABILITIES below).
> >
> > How's that?
> >
> > Thanks...
> >
> > peace & happiness,
> > martin
> >
> >  
> 
> -- 
> 
> BOB BUCK
> SENIOR PLATFORM SOFTWARE ENGINEER
> 
> SKIDMORE, OWINGS & MERRILL
> 7 WORLD TRADE CENTER
> 250 GREENWICH STREET
> NEW YORK, NY 10007
> T  (212) 298-9624
> ROBERT.BUCK at SOM.COM
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/options/samba