[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting

Sun Jun 18 04:47:23 UTC 2023

Hi tuqp,

On Thu, 15 Jun 2023 15:48:52 +0800, "tu.qiuping" <tu.qiuping at qq.com>
wrote:

> I'm sorry that I missed the log of node 2. The log of node 2 after
> 2023-06-12T19:39:40.086620 is as follows:

No problem.  Thanks for posting it.  However, it doesn't seem to tell us
anything new...

> As you can see from the log, node 2 did not receive the broadcast from node 0.

I think the main problem is here:

>   2023-06-12T19:39:42.617032+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=2

where node 0 believes node 2 is the leader even though it is the
leader.  This can only happen if the cluster lock is not configured in
a reliable way.  More on this below.

> At the same time, my CTDB cluster is setup as follows:
> MonitorInterval=5
> RecoveryBanPeriod=5
> KeepaliveInterval=1
> KeepaliveLimit=2
> ElectionTimeout=2

These setting are much more ambitious than I have ever seen tested
with CTDB.  You appear to be trying to achieve very fast failover:
MonitorInterval, KeepaliveInterval and KeepaliveLimit seem to be set to
very low values.  What workload requires such fast failover?

In "Designing Data-Intensive Applications", Martin Kleppmann says:

  If a timeout is the only sure way of detecting a fault, then how long
  should the timeout be? There is unfortunately no simple answer.

  A long timeout means a long wait until a node is declared dead (and
  during this time, users may have to wait or see error messages). A
  short timeout detects faults faster, but carries a higher risk of
  incorrectly declaring a node dead when in fact it has only suffered a
  temporary slowdown (e.g., due to a load spike on the node or the
  network).

  Prematurely declaring a node dead is problematic: if the node is
  actually alive and in the middle of performing some action (for
  example, sending an email), and another node takes over, the action
  may end up being performed twice. We will discuss this issue in more
  detail in “Knowledge, Truth, and Lies”, and in Chapters 9 and 11.

I would say the same advice applies to short network failures.

If you haven't read this book, I encourage you to do so, especially
Part II. Distributed Data.  If nothing else, then at least the section
on Unreliable Networks.  :-)

In CTDB, banning is meant to exclude a misbehaving node from the
cluster. If you set RecoveryBanPeriod to only 5s then if a node is
misbehaving due to overload then it is unlikely to have time to return
to a useful state.

> And the renew time of the cluster lock is 4s, and the default values
> are used for other settings.

Do you see a problem on every failover, or just sometimes?

The default setting for "leader timeout" is 5.  With the above
Keepalive* settings, I am surprised to see a leader broadcast timeout
logged before nodes notice they are disconnected.  Is "leader timeout"
definitely set to the default value of 5?

I definitely think that the cluster lock needs to be checked more often
than every 5s.  Apparently you are renewing it every 4s.

Aa leader broadcast timeout has occurred at 19:39:32.604, suggesting
that node 2 is disconnected.   Node 0 can take the cluster lock at
19:39:39.229, so it seems odd that node 2 still thinks it is leader at
19:39:42.617 when node 0 receives a leader broadcast from it.

Perhaps Ceph is failing over too slowly?

Note that the lock is only release by node 2 here:

    2023-06-12T19:39:42.866980+08:00 host-192-168-40-133 ctdbd[2520003]: Lock contention during renew: -16
    2023-06-12T19:39:42.888760+08:00 host-192-168-40-133 ctdbd[2520003]: /usr/libexec/ctdb/ctdb_mutex_clove_rados_helper: Failed to drop lock on RADOS object 'ctdb_reclock' - (No such file or directory)
    2023-06-12T19:39:42.894979+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Cluster lock helper terminated
    2023-06-12T19:39:42.895168+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Start election

For things to work properly, it needs to be released faster than this.

> And, I use the following command to test node 2 for network
> vibration.

> ==> for i in {1..1000};do ifdown bond1.906&&sleep 5s&&ifup bond1.906&&sleep 5s;done

I think continuously testing network up and down every 5s is probably
too ambitious.  If I had a network that continuously went up and down
that frequently then I would treat this as the normal state and
attempt to configure the cluster to take longer to failover!  Well,
honestly, I would not use that network.

> It should be noted that interface bond1.906 contains private
> addresses and public addresses of CTDB

While this isn't recommended, I can't see it causing a problem with
your test.  However, it might be worth trying to do this test by
controlling switch ports instead.

I'm not sure there is much more I can say...

peace & happiness,
martin