[PATCH] ctdb: try to fix ctdb endless banning loop

Wed Jun 1 02:28:18 UTC 2016

On 2016-06-01 at 12:15 +1000, Amitay Isaacs wrote:
> On Wed, Jun 1, 2016 at 12:03 PM, Michael Adam <obnox at samba.org> wrote:
> 
> > On 2016-06-01 at 11:05 +1000, Amitay Isaacs wrote:
> > > Hi Michael,
> > >
> > > On Wed, Jun 1, 2016 at 9:39 AM, Michael Adam <obnox at samba.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > We are experiencing indefinite banning of nodes in ctdb.
> > > > This is the pattern:
> > > >
> > > > When a inter-node-nic is brought down on a non-recmaster node,
> > > > the node goes to banned state. But since 4.4, this node never
> > > > comes back in our tests. The reason is that the db's don't
> > > > get frozen.
> > > >
> > >
> > > Can you provide the logs when this is happening?  If the databases are
> > not
> > > getting frozen, then there is something else going wrong.  Once the
> > > controls are sent to freeze the databases, you don't need to re-send the
> > > freeze controls.
> >
> > In our case, the DBs were not frozen since the node was already
> > in RECOVERY_ACTIVE state (by election code) when getting banned.
> > Will try to come up with logs.
> >
> > > Since you are breaking the inter-node connectivity, recmaster cannot tell
> > > the node to go into recovery and freeze the databases.  That's the real
> > > problem.  Hmm, looks like we need to add freezing of databases back in
> > the
> > > banning code.
> > >
> > > The  main reason for removing the freeze from banning was due to very
> > > subtle interaction between recovery and banning.  I am going to clean the
> > > freeze code to remove database priorities.  That should simplify
> > re-adding
> > > freeze in the banning code.
> > >
> > >
> > > > Attached find my first attempt to fix this. See the commit
> > > > message for further explanations and analysis.
> > > >
> > > > I still need to test this more, but wanted to share the patch
> > > > early to get feed-back.
> > > >
> > > > Comments/review/push appreciated...
> > > >
> > >
> > > This is definitely wrong.  The function ctdb_db_all_frozen() should only
> > be
> > > called from ctdb daemon and not from recovery daemon.  The database
> > frozen
> > > state is only stored in ctdb daemon.
> >
> > Right ... thanks!
> > While we're discussing a better patch,
> > I'll make tests with the attached version
> > that just sends the freeze unconditionally,
> > omitting the check that would not work.
> 
> 
> Here's a slightly better patch.

That looks good! Thanks... I was still looking for the right
place to store the 'alrady-did-freeze' info. :-)

Feel free to push with my r-b.
Will also run tests and provide feed-back.

> Also, I have reverted d8f3b490bbb691c9916eed0df5b980c1aef23c85.

As discussed I disagree with the revert (for now).
This is a possible further (re)optimization but
breaks separation that the patches had introduced.

Cheers - Michael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160601/85730b83/signature.sig>