[PATCH]: Inconsistent recmaster during election.
Martin Schwenke
martin at meltin.net
Mon Jan 4 02:49:21 UTC 2016
Hi Kenny,
On Wed, 30 Dec 2015 17:34:22 -0800, Kenny Dinh <kdinh at peaxy.net> wrote:
> The timing has to be right in order to reproduce the issue. I'm using a
> script that restarts the RECMASTER, then immediately restarts the other
> node. With the test script, I can consistently reproduce the issue.
* There looks to be a time difference of about 6 seconds between the 2
nodes. That makes comparing the logs slightly non-trivial.
* For the 1st of the 2 bans, there are no logs from the recmaster
node. However, I think I can make a reasonable guess about what
happened.
2015/12/28 11:58:04.298224 [recoverd: 3175]: Recovery lock taken
successfully 2015/12/28 11:58:04.298452 [ 2990]: High RECLOCK
latency 50.017622s for operation recd reclock
So node 0 thinks it is master and the above means it took 50 seconds
to be able to take an fcntl(2) lock on the cluster filesystem.
During this time the recovery daemon will block and does not respond
to requests. When node 1 comes up it causes an election, which it
wins because it gets no responses. Node 1 then probably
notices the inconsistency of recmaster and bans node 0.
I'm doing some work at the moment to move the handling of the
recovery master into a helper process. One side effect of this
should be that the recovery daemon no longer blocks if it take a long
time to take the lock.
However, you need to take a look and see why your cluster filesystem
is taken 50s to allow a lock to be taken.
* For the 2nd of the 2 bans, it looks like the ban occurs during
shutdown:
2015/12/28 12:12:20.514600 [ 2990]: Received SHUTDOWN command.
2015/12/28 12:12:20.514681 [ 2990]: Monitoring has been stopped
2015/12/28 12:12:26.636033 [ 2990]: Freeze priority 1
2015/12/28 12:12:26.636377 [ 2990]: Freeze priority 2
2015/12/28 12:12:26.636684 [ 2990]: Freeze priority 3
2015/12/28 12:12:29.643525 [ 2990]: Freeze priority 1
2015/12/28 12:12:29.643911 [ 2990]: Freeze priority 2
2015/12/28 12:12:29.644270 [ 2990]: Freeze priority 3
2015/12/28 12:12:32.651516 [ 2990]: Freeze priority 1
2015/12/28 12:12:32.651957 [ 2990]: Freeze priority 2
2015/12/28 12:12:32.652349 [ 2990]: Freeze priority 3
2015/12/28 12:12:35.659636 [ 2990]: Freeze priority 1
2015/12/28 12:12:35.660188 [ 2990]: Freeze priority 2
2015/12/28 12:12:35.660635 [ 2990]: Freeze priority 3
2015/12/28 12:12:38.665811 [ 2990]: Banning this node for 300
seconds 2015/12/28 12:12:38.665910 [ 2990]:
server/ctdb_takeover.c:3303 Released 0 public IPs 2015/12/28
12:12:38.667699 [ 2990]: This node has been banned - forcing freeze
and recovery 2015/12/28 12:12:38.667775 [ 2990]:
server/ctdb_takeover.c:3303 Released 0 public IPs
The most likely answer here is that the recovery daemon has been
shut down (this is logged at NOTICE level, and I assume you're using
default ERR level), so the node is no longer taking part in
elections:
2015/12/28 12:12:17.370578 [recoverd:11292]:
server/ctdb_recoverd.c:3760 Current recmaster node 0 does n ot have
CAP_RECMASTER, but we (node 1) have - force an election 2015/12/28
12:12:17.370978 [11107]: Freeze priority 3 2015/12/28
12:12:17.371226 [11107]: This node (1) is now the recovery master
2015/12/28 12:12:17.663482 [11107]: 172.16.150.8:4379: connected to
172.16.150.12:4379 - 1 connected 2015/12/28 12:12:17.665621
[11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:18.665789
[11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:19.665965
[11107]: CTDB_WAIT_UNTIL_RECOVERED 2015/12/28 12:12:20.374488
[recoverd:11292]: Election period ended
The main daemon is waiting for IP addresses to be released but this
is hanging when trying to talk to winbind. This means the main
daemon keeps returning the old master node, causing the sanity check
on node 1 to fail, resulting in a ban:
The ban during shutdown is really of no consequence. It takes the
node out of the cluster a little earlier than would otherwise have
happened
The hung script during shutdown needs to be understood.
Unfortunately, the hung script debugging looks incomplete. The most
likely problem is that (for some reason) "smbcontrol winbindd
ip-dropped" is taking a long time. Alternatively, if the hung
script debugging is very incomplete then the problem could be that
stat(1)-ing the reclock file in 01.reclock is taking a long time,
which would be consistent with the other case.
So, both bans look to be reasonable - the nodes are not interacting as
expected. The first could be avoided by taking the recovery lock in a
more asynchronous manner, which I'm already working on... but it is a
major change. I don't think there is a simple patch that will fix what
you're seeing.
I think you need to look at the locking performance of your cluster
filesystem.
If you would like more analysis then please synchronise the clocks on
the 2 nodes and recreate with debug level 2/NOTICE. That should give
us better and more reliable information. :-)
peace & happiness,
martin
More information about the samba-technical
mailing list