[PATCH]: Inconsistent recmaster during election.

Martin Schwenke martin at meltin.net
Thu Dec 31 01:07:55 UTC 2015

On Tue, 29 Dec 2015 10:39:26 -0800, Kenny Dinh <kdinh at peaxy.net> wrote:

> [PATCH] updated.
> I have a more targeted fix in the new version of my patch.  Please
> disregard my previous patch.  In the new patch, we should let the recovery
> master win the election if the recmaster sent the election message and we
> agreed that it is the recovery master.
> Please review and push if it looks good.

Just a note to say that this isn't being ignored...  :-)

I don't think I've seen this particular problem before but I've
probably seen similar ones. I assume that this is an inconsistency
between the main daemon and recovery daemon about which node is the
recovery master.  I need to get some clear time to take a closer look at
your logs to understand this correctly. That might not happen until next

Can you recreate this problem?  Every time?

rec->ctdb->recovery_master shouldn't really be used in the recovery
daemon.  I know that it is already used.  :-(  However, it is only set
in the monitor handler, so there are corner cases where it will be
incorrect (e.g. election called due to capability change).

I think rec->recmaster is a better choice but in current versions it
might suffer from similar problems.  However, I'm still not sure if
your patch is the best way of approaching the problem... since I don't
yet understand the problem.  ;-)

In Samba master I have made quite a few changes to the handling of the
recovery master.  The recovery daemon now tracks election results
internally and remembers who is master.  It only ever retrieves the
recovery master from remote nodes for consistency checking.  This
should eliminate a whole class of bugs.

The longer term plan is that elections, tracking of the master node,
cluster membership and keep-alives will all be done in a cluster
management daemon

More once I've had time to understand the logs...

peace & happiness,

