[PATCH] ctdb: try to fix ctdb endless banning loop

Amitay Isaacs amitay at gmail.com
Wed Jun 1 02:15:19 UTC 2016


On Wed, Jun 1, 2016 at 12:03 PM, Michael Adam <obnox at samba.org> wrote:

> On 2016-06-01 at 11:05 +1000, Amitay Isaacs wrote:
> > Hi Michael,
> >
> > On Wed, Jun 1, 2016 at 9:39 AM, Michael Adam <obnox at samba.org> wrote:
> >
> > > Hi,
> > >
> > > We are experiencing indefinite banning of nodes in ctdb.
> > > This is the pattern:
> > >
> > > When a inter-node-nic is brought down on a non-recmaster node,
> > > the node goes to banned state. But since 4.4, this node never
> > > comes back in our tests. The reason is that the db's don't
> > > get frozen.
> > >
> >
> > Can you provide the logs when this is happening?  If the databases are
> not
> > getting frozen, then there is something else going wrong.  Once the
> > controls are sent to freeze the databases, you don't need to re-send the
> > freeze controls.
>
> In our case, the DBs were not frozen since the node was already
> in RECOVERY_ACTIVE state (by election code) when getting banned.
> Will try to come up with logs.
>
> > Since you are breaking the inter-node connectivity, recmaster cannot tell
> > the node to go into recovery and freeze the databases.  That's the real
> > problem.  Hmm, looks like we need to add freezing of databases back in
> the
> > banning code.
> >
> > The  main reason for removing the freeze from banning was due to very
> > subtle interaction between recovery and banning.  I am going to clean the
> > freeze code to remove database priorities.  That should simplify
> re-adding
> > freeze in the banning code.
> >
> >
> > > Attached find my first attempt to fix this. See the commit
> > > message for further explanations and analysis.
> > >
> > > I still need to test this more, but wanted to share the patch
> > > early to get feed-back.
> > >
> > > Comments/review/push appreciated...
> > >
> >
> > This is definitely wrong.  The function ctdb_db_all_frozen() should only
> be
> > called from ctdb daemon and not from recovery daemon.  The database
> frozen
> > state is only stored in ctdb daemon.
>
> Right ... thanks!
> While we're discussing a better patch,
> I'll make tests with the attached version
> that just sends the freeze unconditionally,
> omitting the check that would not work.


Here's a slightly better patch.

Also, I have reverted d8f3b490bbb691c9916eed0df5b980c1aef23c85.

Amitay.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb.patches
Type: application/octet-stream
Size: 3295 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160601/dba1ae6c/ctdb.obj>


More information about the samba-technical mailing list