[PATCH]: Inconsistent recmaster during election.

Martin Schwenke martin at meltin.net
Thu Jan 21 06:21:26 UTC 2016


Hi Kenny,

On Wed, 20 Jan 2016 19:03:40 -0800, Kenny Dinh <kdinh at peaxy.net> wrote:

> > "We need to figure out if some changes to winbind mean that this  
> particular smbcontrol is no longer required."
> 
> Do you mean recent patches to winbind?  Do you want me to test with samba
> 4.2.x or any newer version of samba, or would samba 4.2.3-10 and ctdb 2.5.4
> be fine?

I thought I saw some changes go by where winbind tracks IP addresses
itself, or similar.  I don't have enough context or time to be able to
dig into this right now.  But I am wondering if the smbcontrol is
needed at all.

> Secondly, I was under the impression that you were going to work toward
> what Volker suggested.
> 
> [Volker] - "So maybe it's time to implement reading from the registry
> without messing with ctdb"

That's tough.  As Volker mentioned, in clustered Samba the registry
is managed by CTDB.  There are potentially some registry entries that
will not usually change at run-time, so they could be retrieved from
the local copy without involving CTDB.  However, that's not something I
have time or expertise to dig into right now.

> From your reply, I am getting the impression that reading from the registry
> is not an issue.  The issue is that smbcontrol needs to connect CTDB as a
> client and it could not do so while ctdb is in recovery mode.  Sorry for
> repeating your word but I wanted to clarify.

That's OK.  The reason that smbcontrol is connecting to CTDB is
probably to try to read the registry.

> If that is the case, one option is to do what you suggested.  Another
> option is to allow the recovery to go ahead without nodes that are in the
> process of shutting down.  Once the recovery is completed, "smbcontrol
> winbindd ip-dropped" will get unblocked.  To omit nodes during recovery,
> nodes that are going down could respond to CTDB_CONTROL_GET_RECMASTER with
> (-1) and the RECMASTER could ignore (-1) in verify_recmaster_callback().  I
> tried this workaround and it worked but I did not mentioned it because I
> thought the correct fix is what Volker suggested above.

In the past, I have thought about adding an extra inactive state for
use during shutdown.  That would automatically exclude nodes that are
shutting down from recoveries.  It would be a fairly small change.

However, it only avoids the problem in 1 case.  If, instead of shutting
down, a node in the cluster becomes unhealthy and an IP takeover run is
in progress, and then a node causes a recovery by becoming inactive, the
same thing can happen.

At this point, my interest is in determining how necessary/important
the smbcontrol is and consider allowing it to time out.  In the best
case we could remove it altogether.

Then we need to go through the scripts and see where else this type of
issue can occur...

> I will test out your suggestion as soon as I get a hold of test hardware.

Thanks.  I just noticed that you probably don't need to use the external
timeout command.  smbcontrol takes a -t option, so you could just try:

  smbcontrol -t 5 winbindd ip-dropped $ip >/dev/null 2>/dev/null

peace & happiness,
martin



More information about the samba-technical mailing list