[PATCH]: Inconsistent recmaster during election.

Tue Jan 5 18:51:32 UTC 2016

Hi Martin,

I tried to send you a set a new logs yesterday.  The email was rejected by
samba mail list but the rejection was not sent until this morning.  I'm
attaching the zipped version of the log.  By the way, my old logs were
running at NOTICE not ERR.  During shutdown, ctdb seems to be dropping
debug messages.  I'm running version ctdb 4.1.13.

I have a new set of logs.  The first ban occurred while ctdb was running at
NOTICE level and the second ban occurred while ctdb was running at INFO
level.  Hopefully, there is better debug info.  This time I don't have any
issue with slow locking issue.

In the first set, the ban occurred at 10:19.

In the second set, the ban occurred at 11:13:51.  This time, ctdb is
running at INFO level.

>From the attached winbindd log, it looks like "smbcontrol  winbindd
ip-dropped" completed successfully and the reply was sent to client, in
both cases.

I found the issue to be the following.  Recmaster was restarted.  The
Banned node forced an election because Recmaster was disconnected. Forcing
election also set ctdb->recovery_mode to CTDB_RECOVERY_ACTIVE. At the same
time, the Banned node also called "ctdb_deferred_drop_all_ips(ctdb)".

When the Recmaster was restarted, it claimed the recovery master role and
triggered an election.  Because the Banned node keep replying that it
thinks it is the recovery master, it got banned.

  I think "ctdb_deferred_drop_all_ips(ctdb)" is what hung the "smbcontrol
winbindd ip-dropped".  When ctdb_control_set_recmode() is called to set
recovery_mode to "NORMAL", the ips were properly released.

I tried out a quick fix by returning (-1) when responding to
CTDB_CONTROL_GET_RECMASTER on the Banned node and on the Recmaster, discard
the (-1) reply.  What this does is that it allows the Recmaster to continue
the recovery process which involved sending CTDB_RECOVERY_NORMAL to all
nodes.  Upon receiving CTDB_RECOVERY_NORMAL control code, the Banned node
resumed the ips release and completed the shutdown and restarted.

However, the 'hack' has an issue. If the Recmaster somehow die in the
middle of recovery, the Banned node will also get stuck in the "hung" state
until it receives a CTDB_RECOVERY_NORMAL control code.

The correct fix is for the Banned node to send itself a
CTDB_RECOVERY_NORMAL during shutdown if its recovery_mode is "ACTIVE" to
thaw its DB and resume IP release operation.  This sounds like a few more
change so I wanted to check with you first.

I'm attaching the "hack" and the set of log files.  Winbind log from the
Banned node is also included.

Thanks again for your time,
~Kenny

On Mon, Jan 4, 2016 at 12:07 PM, Kenny Dinh <kdinh at peaxy.net> wrote:

> Hi Martin,
>
> Sorry for the time sync issue.  My old logs were running at NOTICE not
> ERR.  I have a new set of logs.  The first ban occurred while ctdb was
> running at NOTICE level and the second ban occurred while ctdb was running
> at INFO level.  Hopefully, there is better debug info.  This time I don't
> have any issue with slow locking issue.
>
> In the first set, the ban occurred at 10:19.
>
> In the second set, the ban occurred at 11:13:51.  This time, ctdb is
> running at INFO level.
>
> From the attached winbindd log, it looks like "smbcontrol  winbindd
> ip-dropped" completed successfully and the reply was sent to client, in
> both cases.
>
> It looks to me like the banned node was in the process of shutting down
> but it was still responding to election messages.
>
> Can I bother you to confirm this?
>
> Thanks again for your time,
> ~Kenny
>
>
> On Sun, Jan 3, 2016 at 6:49 PM, Martin Schwenke <martin at meltin.net> wrote:
>
>> Hi Kenny,
>>
>> On Wed, 30 Dec 2015 17:34:22 -0800, Kenny Dinh <kdinh at peaxy.net> wrote:
>>
>> > The timing has to be right in order to reproduce the issue.  I'm using a
>> > script that restarts the RECMASTER, then immediately restarts the other
>> > node.  With the test script, I can consistently reproduce the issue.
>>
>> * There looks to be a time difference of about 6 seconds between the 2
>>   nodes.  That makes comparing the logs slightly non-trivial.
>>
>> * For the 1st of the 2 bans, there are no logs from the recmaster
>>   node.  However, I think I can make a reasonable guess about what
>>   happened.
>>
>>     2015/12/28 11:58:04.298224 [recoverd: 3175]: Recovery lock taken
>>     successfully 2015/12/28 11:58:04.298452 [ 2990]: High RECLOCK
>>     latency 50.017622s for operation recd reclock
>>
>>   So node 0 thinks it is master and the above means it took 50 seconds
>>   to be able to take an fcntl(2) lock on the cluster filesystem.
>>   During this time the recovery daemon will block and does not respond
>>   to requests. When node 1 comes up it causes an election, which it
>>   wins because it gets no responses.  Node 1 then probably
>>   notices the inconsistency of recmaster and bans node 0.
>>
>>   I'm doing some work at the moment to move the handling of the
>>   recovery master into a helper process.  One side effect of this
>>   should be that the recovery daemon no longer blocks if it take a long
>>   time to take the lock.
>>
>>   However, you need to take a look and see why your cluster filesystem
>>   is taken 50s to allow a lock to be taken.
>>
>> * For the 2nd of the 2 bans, it looks like the ban occurs during
>>   shutdown:
>>
>>     2015/12/28 12:12:20.514600 [ 2990]: Received SHUTDOWN command.
>>     2015/12/28 12:12:20.514681 [ 2990]: Monitoring has been stopped
>>     2015/12/28 12:12:26.636033 [ 2990]: Freeze priority 1
>>     2015/12/28 12:12:26.636377 [ 2990]: Freeze priority 2
>>     2015/12/28 12:12:26.636684 [ 2990]: Freeze priority 3
>>     2015/12/28 12:12:29.643525 [ 2990]: Freeze priority 1
>>     2015/12/28 12:12:29.643911 [ 2990]: Freeze priority 2
>>     2015/12/28 12:12:29.644270 [ 2990]: Freeze priority 3
>>     2015/12/28 12:12:32.651516 [ 2990]: Freeze priority 1
>>     2015/12/28 12:12:32.651957 [ 2990]: Freeze priority 2
>>     2015/12/28 12:12:32.652349 [ 2990]: Freeze priority 3
>>     2015/12/28 12:12:35.659636 [ 2990]: Freeze priority 1
>>     2015/12/28 12:12:35.660188 [ 2990]: Freeze priority 2
>>     2015/12/28 12:12:35.660635 [ 2990]: Freeze priority 3
>>     2015/12/28 12:12:38.665811 [ 2990]: Banning this node for 300
>>     seconds 2015/12/28 12:12:38.665910 [ 2990]:
>>     server/ctdb_takeover.c:3303 Released 0 public IPs 2015/12/28
>>     12:12:38.667699 [ 2990]: This node has been banned - forcing freeze
>>     and recovery 2015/12/28 12:12:38.667775 [ 2990]:
>>     server/ctdb_takeover.c:3303 Released 0 public IPs
>>
>>   The most likely answer here is that the recovery daemon has been
>>   shut down (this is logged at NOTICE level, and I assume you're using
>>   default ERR level), so the node is no longer taking part in
>>   elections:
>>
>>     2015/12/28 12:12:17.370578 [recoverd:11292]:
>>     server/ctdb_recoverd.c:3760 Current recmaster node 0 does n ot have
>>     CAP_RECMASTER, but we (node 1) have - force an election 2015/12/28
>>     12:12:17.370978 [11107]: Freeze priority 3 2015/12/28
>>     12:12:17.371226 [11107]: This node (1) is now the recovery master
>>     2015/12/28 12:12:17.663482 [11107]: 172.16.150.8:4379: connected to
>>     172.16.150.12:4379 - 1 connected 2015/12/28 12:12:17.665621
>>     [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:18.665789
>>     [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:19.665965
>>     [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:20.374488
>>     [recoverd:11292]: Election period ended
>>
>>   The main daemon is waiting for IP addresses to be released but this
>>   is hanging when trying to talk to winbind.  This means the main
>>   daemon keeps returning the old master node, causing the sanity check
>>   on node 1 to fail, resulting in a ban:
>>
>>   The ban during shutdown is really of no consequence.  It takes the
>>   node out of the cluster a little earlier than would otherwise have
>>   happened
>>
>>   The hung script during shutdown needs to be understood.
>>   Unfortunately, the hung script debugging looks incomplete.  The most
>>   likely problem is that (for some reason) "smbcontrol winbindd
>>   ip-dropped" is taking a long time.  Alternatively, if the hung
>>   script debugging is very incomplete then the problem could be that
>>   stat(1)-ing the reclock file in 01.reclock is taking a long time,
>>   which would be consistent with the other case.
>>
>> So, both bans look to be reasonable - the nodes are not interacting as
>> expected.  The first could be avoided by taking the recovery lock in a
>> more asynchronous manner, which I'm already working on... but it is a
>> major change.  I don't think there is a simple patch that will fix what
>> you're seeing.
>>
>> I think you need to look at the locking performance of your cluster
>> filesystem.
>>
>> If you would like more analysis then please synchronise the clocks on
>> the 2 nodes and recreate with debug level 2/NOTICE.  That should give
>> us better and more reliable information.  :-)
>>
>> peace & happiness,
>> martin
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch
Type: application/octet-stream
Size: 1302 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160105/4a0340ff/patch-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11_13_INFO.zip
Type: application/zip
Size: 267361 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160105/4a0340ff/11_13_INFO-0001.zip>