CTDB_RECOVERY_ACTIVE while in CTDB_RUNSTATE_STARTUP

Kenny Dinh kdinh at peaxy.net
Tue May 24 16:20:56 UTC 2016


Amitay,

Based on change history for Samba 4.3.x and 4.4.x.  It looks like fix for
the deadlock issue you mentioned is only available in the 4.4.x but not the
4.3.x branch.

Thank you for the insight!
~Kenny

On Tue, May 24, 2016 at 12:57 AM, Amitay Isaacs <amitay at gmail.com> wrote:

> Hi Kenny,
>
> On Tue, May 24, 2016 at 8:38 AM, Kenny Dinh <kdinh at peaxy.net> wrote:
>
>> I found out my previous change was not correct.
>>
>> I traced the history of the file server/eventscript.c.  At commit "
>> fd06167caa2c194e74c651e1374047213c6cd9d5", we updated the function
>> ctdb_event_script_callback_v() to allow CTDB_EVENT_INIT to be called while
>> ctdb->recovery_mode is ACTIVE.  I believe that CTDB_EVENT_STARTUP should
>> also be allowed to be executed while ctdb->recovery_mode is ACTIVE.
>>
>> Attach is what I think should be the correct fix.
>>
>> Thanks,
>> Kenny
>>
>>
>> On Mon, May 23, 2016 at 1:10 PM, Kenny Dinh <kdinh at peaxy.net> wrote:
>>
>> > Hello,
>> >
>> > I saw a one off error in CTDB that I was not able to reproduce it. My
>> > setup has 3 CTDB nodes.  Attached is the ctdb log from failed node.  In
>> the
>> > failed CTDB node, ctdb process (17202) was starting up and its
>> > ctdb->runstate is CTDB_RUNSTATE_STARTUP.
>> >
>> >
>> >    1.  At 18:42:55, it tried to invoke "ctdb_run_startup" but the
>> >    "49.winbind startup" script timed out.  If you look at the attached
>> log,
>> >    the recovery mode has been set to ACTIVE just a few seconds after
>> >    "ctdb_run_startup" was invoked.
>> >    2. At 18:44:48, the "ctdb_run_startup" was rescheduled but the script
>> >    was not allowed to run while in recovery mode. The error was
>> "*Refusing
>> >    to run event scripts call 'startup' while in recovery*"
>> >    3. From then on, this error kept on repeating.
>> >
>> > As for the first issue, I don't know why winbind failed to start
>> because I
>> > lost winbind log from that time.
>> >
>> > For the second issue, it occurs to me that we should not allow recovery
>> > mode to be set to ACTIVE if the run state is still in
>> > CTDB_RUNSTATE_STARTUP.  Does anyone see why the following would have any
>> > issue?
>> >
>> >
>> > diff --git a/server/ctdb_recover.c b/server/ctdb_recover.c
>> > index 21e0427..4c5030f 100644
>> > --- a/server/ctdb_recover.c
>> > +++ b/server/ctdb_recover.c
>> > @@ -595,6 +595,12 @@ int32_t ctdb_control_set_recmode(struct
>> ctdb_context
>> > *ctdb,
>> >         struct ctdb_set_recmode_state *state;
>> >         pid_t parent = getpid();
>> >
>> > +       if (ctdb->runstate < CTDB_STATE_RUNNING &&
>> > +           recmode == CTDB_RECOVERY_ACTIVE) {
>> > +               DEBUG(DEBUG_ERR, (__location__ " Not setting state to
>> > ACTIVE when runstate (%d) is < CTDB_STATE_RUNNING\n"));
>> > +               return -1;
>> > +       }
>> > +
>> >         /* if we enter recovery but stay in recovery for too long
>> >            we will eventually drop all our ip addresses
>> >         */
>> >
>> >
>> >
>> >
>>
>
> The change you have suggested is not correct.  The introduction of run
> states in CTDB was done to serialize the booting sequence inside CTDB.
> STARTUP run state will be happen after FIRST_RECOVERY.  And if the STARTUP
> does not complete successfully, then CTDB will never go to RUNNING state.
>
> The other issue is that "startup" event cannot be run when recovery is
> active.  Smbd and winbindd will try to attach to various databases when
> starting up which will fail if the recovery is active.  If for some reason
> "startup" event fails, then it will be retried as long as recovery is not
> active.  In your case, the node seems to be stuck in recovery for 15
> minutes, but eventually startup succeeds.
>
>     2016/05/04 18:43:25.123261 [17202]: Event script '49.winbind startup '
> timed out after 29.4s, count: 0, pid: 17658
>     2016/05/04 18:43:25.123288 [17202]: startup event failed
>     [...]
>     2016/05/04 18:58:13.169013 [17202]: startup event OK - enabling
> monitoring
>     2016/05/04 18:58:13.169036 [17202]: Set runstate to RUNNING (5)
>
> As far as I can see there is nothing wrong with startup event and STARTUP
> run state processing.
>
> However, there is a very nasty deadlock bug in CTDB 2.5.x, which will
> prevent recovery from completing or taking a really long time.  May be you
> are seeing that issue and that's why database recovery is taking 15 minutes
> to complete.  I would recommend switching to CTDB 4.4.x, where this
> deadlock bug has been resolved.
>
> Amitay.
>
>


More information about the samba-technical mailing list