CTDB_RECOVERY_ACTIVE while in CTDB_RUNSTATE_STARTUP

Tue May 24 07:57:38 UTC 2016

Hi Kenny,

On Tue, May 24, 2016 at 8:38 AM, Kenny Dinh <kdinh at peaxy.net> wrote:

> I found out my previous change was not correct.
>
> I traced the history of the file server/eventscript.c.  At commit "
> fd06167caa2c194e74c651e1374047213c6cd9d5", we updated the function
> ctdb_event_script_callback_v() to allow CTDB_EVENT_INIT to be called while
> ctdb->recovery_mode is ACTIVE.  I believe that CTDB_EVENT_STARTUP should
> also be allowed to be executed while ctdb->recovery_mode is ACTIVE.
>
> Attach is what I think should be the correct fix.
>
> Thanks,
> Kenny
>
>
> On Mon, May 23, 2016 at 1:10 PM, Kenny Dinh <kdinh at peaxy.net> wrote:
>
> > Hello,
> >
> > I saw a one off error in CTDB that I was not able to reproduce it. My
> > setup has 3 CTDB nodes.  Attached is the ctdb log from failed node.  In
> the
> > failed CTDB node, ctdb process (17202) was starting up and its
> > ctdb->runstate is CTDB_RUNSTATE_STARTUP.
> >
> >
> >    1.  At 18:42:55, it tried to invoke "ctdb_run_startup" but the
> >    "49.winbind startup" script timed out.  If you look at the attached
> log,
> >    the recovery mode has been set to ACTIVE just a few seconds after
> >    "ctdb_run_startup" was invoked.
> >    2. At 18:44:48, the "ctdb_run_startup" was rescheduled but the script
> >    was not allowed to run while in recovery mode. The error was
> "*Refusing
> >    to run event scripts call 'startup' while in recovery*"
> >    3. From then on, this error kept on repeating.
> >
> > As for the first issue, I don't know why winbind failed to start because
> I
> > lost winbind log from that time.
> >
> > For the second issue, it occurs to me that we should not allow recovery
> > mode to be set to ACTIVE if the run state is still in
> > CTDB_RUNSTATE_STARTUP.  Does anyone see why the following would have any
> > issue?
> >
> >
> > diff --git a/server/ctdb_recover.c b/server/ctdb_recover.c
> > index 21e0427..4c5030f 100644
> > --- a/server/ctdb_recover.c
> > +++ b/server/ctdb_recover.c
> > @@ -595,6 +595,12 @@ int32_t ctdb_control_set_recmode(struct ctdb_context
> > *ctdb,
> >         struct ctdb_set_recmode_state *state;
> >         pid_t parent = getpid();
> >
> > +       if (ctdb->runstate < CTDB_STATE_RUNNING &&
> > +           recmode == CTDB_RECOVERY_ACTIVE) {
> > +               DEBUG(DEBUG_ERR, (__location__ " Not setting state to
> > ACTIVE when runstate (%d) is < CTDB_STATE_RUNNING\n"));
> > +               return -1;
> > +       }
> > +
> >         /* if we enter recovery but stay in recovery for too long
> >            we will eventually drop all our ip addresses
> >         */
> >
> >
> >
> >
>

The change you have suggested is not correct.  The introduction of run
states in CTDB was done to serialize the booting sequence inside CTDB.
STARTUP run state will be happen after FIRST_RECOVERY.  And if the STARTUP
does not complete successfully, then CTDB will never go to RUNNING state.

The other issue is that "startup" event cannot be run when recovery is
active.  Smbd and winbindd will try to attach to various databases when
starting up which will fail if the recovery is active.  If for some reason
"startup" event fails, then it will be retried as long as recovery is not
active.  In your case, the node seems to be stuck in recovery for 15
minutes, but eventually startup succeeds.

    2016/05/04 18:43:25.123261 [17202]: Event script '49.winbind startup '
timed out after 29.4s, count: 0, pid: 17658
    2016/05/04 18:43:25.123288 [17202]: startup event failed
    [...]
    2016/05/04 18:58:13.169013 [17202]: startup event OK - enabling
monitoring
    2016/05/04 18:58:13.169036 [17202]: Set runstate to RUNNING (5)

As far as I can see there is nothing wrong with startup event and STARTUP
run state processing.

However, there is a very nasty deadlock bug in CTDB 2.5.x, which will
prevent recovery from completing or taking a really long time.  May be you
are seeing that issue and that's why database recovery is taking 15 minutes
to complete.  I would recommend switching to CTDB 4.4.x, where this
deadlock bug has been resolved.

Amitay.