[PATCH]: Inconsistent recmaster during election.

Mon Jan 4 02:49:21 UTC 2016

Hi Kenny,

On Wed, 30 Dec 2015 17:34:22 -0800, Kenny Dinh <kdinh at peaxy.net> wrote:

> The timing has to be right in order to reproduce the issue.  I'm using a
> script that restarts the RECMASTER, then immediately restarts the other
> node.  With the test script, I can consistently reproduce the issue.

* There looks to be a time difference of about 6 seconds between the 2
  nodes.  That makes comparing the logs slightly non-trivial.

* For the 1st of the 2 bans, there are no logs from the recmaster
  node.  However, I think I can make a reasonable guess about what
  happened.

    2015/12/28 11:58:04.298224 [recoverd: 3175]: Recovery lock taken
    successfully 2015/12/28 11:58:04.298452 [ 2990]: High RECLOCK
    latency 50.017622s for operation recd reclock

  So node 0 thinks it is master and the above means it took 50 seconds
  to be able to take an fcntl(2) lock on the cluster filesystem.
  During this time the recovery daemon will block and does not respond
  to requests. When node 1 comes up it causes an election, which it
  wins because it gets no responses.  Node 1 then probably
  notices the inconsistency of recmaster and bans node 0.

  I'm doing some work at the moment to move the handling of the
  recovery master into a helper process.  One side effect of this
  should be that the recovery daemon no longer blocks if it take a long
  time to take the lock.

  However, you need to take a look and see why your cluster filesystem
  is taken 50s to allow a lock to be taken.

* For the 2nd of the 2 bans, it looks like the ban occurs during
  shutdown:

    2015/12/28 12:12:20.514600 [ 2990]: Received SHUTDOWN command.
    2015/12/28 12:12:20.514681 [ 2990]: Monitoring has been stopped
    2015/12/28 12:12:26.636033 [ 2990]: Freeze priority 1
    2015/12/28 12:12:26.636377 [ 2990]: Freeze priority 2
    2015/12/28 12:12:26.636684 [ 2990]: Freeze priority 3
    2015/12/28 12:12:29.643525 [ 2990]: Freeze priority 1
    2015/12/28 12:12:29.643911 [ 2990]: Freeze priority 2
    2015/12/28 12:12:29.644270 [ 2990]: Freeze priority 3
    2015/12/28 12:12:32.651516 [ 2990]: Freeze priority 1
    2015/12/28 12:12:32.651957 [ 2990]: Freeze priority 2
    2015/12/28 12:12:32.652349 [ 2990]: Freeze priority 3
    2015/12/28 12:12:35.659636 [ 2990]: Freeze priority 1
    2015/12/28 12:12:35.660188 [ 2990]: Freeze priority 2
    2015/12/28 12:12:35.660635 [ 2990]: Freeze priority 3
    2015/12/28 12:12:38.665811 [ 2990]: Banning this node for 300
    seconds 2015/12/28 12:12:38.665910 [ 2990]:
    server/ctdb_takeover.c:3303 Released 0 public IPs 2015/12/28
    12:12:38.667699 [ 2990]: This node has been banned - forcing freeze
    and recovery 2015/12/28 12:12:38.667775 [ 2990]:
    server/ctdb_takeover.c:3303 Released 0 public IPs

  The most likely answer here is that the recovery daemon has been
  shut down (this is logged at NOTICE level, and I assume you're using
  default ERR level), so the node is no longer taking part in
  elections:

    2015/12/28 12:12:17.370578 [recoverd:11292]:
    server/ctdb_recoverd.c:3760 Current recmaster node 0 does n ot have
    CAP_RECMASTER, but we (node 1) have - force an election 2015/12/28
    12:12:17.370978 [11107]: Freeze priority 3 2015/12/28
    12:12:17.371226 [11107]: This node (1) is now the recovery master
    2015/12/28 12:12:17.663482 [11107]: 172.16.150.8:4379: connected to
    172.16.150.12:4379 - 1 connected 2015/12/28 12:12:17.665621
    [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:18.665789
    [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:19.665965
    [11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:20.374488
    [recoverd:11292]: Election period ended

  The main daemon is waiting for IP addresses to be released but this
  is hanging when trying to talk to winbind.  This means the main
  daemon keeps returning the old master node, causing the sanity check
  on node 1 to fail, resulting in a ban:

  The ban during shutdown is really of no consequence.  It takes the
  node out of the cluster a little earlier than would otherwise have
  happened

  The hung script during shutdown needs to be understood.
  Unfortunately, the hung script debugging looks incomplete.  The most
  likely problem is that (for some reason) "smbcontrol winbindd
  ip-dropped" is taking a long time.  Alternatively, if the hung
  script debugging is very incomplete then the problem could be that
  stat(1)-ing the reclock file in 01.reclock is taking a long time,
  which would be consistent with the other case.

So, both bans look to be reasonable - the nodes are not interacting as
expected.  The first could be avoided by taking the recovery lock in a
more asynchronous manner, which I'm already working on... but it is a
major change.  I don't think there is a simple patch that will fix what
you're seeing.

I think you need to look at the locking performance of your cluster
filesystem.

If you would like more analysis then please synchronise the clocks on
the 2 nodes and recreate with debug level 2/NOTICE.  That should give
us better and more reliable information.  :-)

peace & happiness,
martin