[Samba] 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes

Wed Nov 29 01:57:20 UTC 2023

Hi,

On Tue, 28 Nov 2023 23:18:14 +0800, "tu.qiuping" <tu.qiuping at qq.com>
wrote:

> As you said, "If the leader becomes disconnected then the cluster
> lock will need to be released (e.g. by the underlying filesystem)
> before another node can be leader.  2 nodes should never be
> leader at the same time.", so what I want to ask is, how does
> the leader node know it has a problem when the network is abnormal.
> From the log, it thinks that other nodes are experiencing network
> anomalies and have taken over all virtual IPs. Or is it true
> that cluster management is incomplete in this situation.

In this case the cluster mutex helper on the leader node would have to
notice that it is no longer capable of holding the cluster lock and
release it.

I don't see any messages about this on node host-192-168-34-164.

On host-192-168-34-165, I notice the lock duration is 4s.  I don't know
what 4096 is - is it a custom change to ctdb_mutex_ceph_rados_helper.c?

I don't know enough about Ceph RADOS to know what should happen here.
You may need to modify the helper so it logs more debugging information.
If you are able to make it better then patches are always welcome!

Also, CTDB's cluster management can definitely be improved.  In the
future, we hope to get time to split cluster management into a separate
service so it is easier to maintain and to contribute to.

> I have another question, which is why there is no log of renewing
> reclock failure for such a long time after disconnecting the network
> of the leader node, as the duration of the lock I set is 2 seconds

Good question.  More debug logging in ctdb_mutex_ceph_rados_helper.c, I
think...

> What I think is that it should renew reclock failed and kept trying
> instead of taking over all virtual IPs first , after the leader
> node's network is abnormal.

I think I agree with you.  The problem here is that host-192-168-34-165
has been able to take the lock, while host-192-168-34-164 has lost the
lock but the lock helper is unaware of this or has not exited.

As I said in a previous thread, your timeout settings are very
ambitious!

To understand the situation, I would suggest testing with increased
timeouts.  You really want to see evidence that the disconnected node
knows that the lock is gone.  You should see this message:

  Cluster lock helper terminated

It will, hopefully, have some messages from
ctdb_mutex_ceph_rados_helper before it.

If you can make this happen then you are starting to win.  Then, as you
decrease the timeouts, you want to ensure that you see this occur
before another node is able to take the cluster lock.  This is the only
way you can be sure you avoid the current "split brain" scenario.

Good luck!

peace & happiness,
martin