[Samba] ctdb tcp kill: remaining connections
Ulrich Sibiller
ulrich.sibiller at eviden.com
Mon Jun 2 16:17:02 UTC 2025
Martin Schwenke schrieb am 29.05.2025 02:40:
> In the "takeip" case the gratarp is sent by the daemon. The relevant
> code is in ctdb_announce_vnn_iface().
Ok, good to know. Need to check if I can simply log that.
> I was going to ask questions about firewalls and forwarding rules on
> routers between the server nodes and the clients. However, you can run
> "ctdb gratarp ..." and it fixes the problem, so it doesn't sound like
> the packets are being filtered somewhere. The "ctdb gratarp" command
> sends a control to the daemon to send the ARPs and, for the control, the
> daemon runs the same low-level code as in ctdb_announce_vnn_iface(),
> which is run from the callback when the "takeip" event succeeds.
>
> There are 2 possible differences:
>
> * The interface:
>
> The "takeip" callback automatically determines the interface from
> which to send the ARPs. This is simply based on the interface to
> which the IP is assigned. I doubt you would be running a manual
> command that specifies a different interface.
Correct. However, the interface is a teaming interface, consisting of two
ports.
> * Timing/routing:>
> Not that it should matter, but are you using the 13.per_ip_routing
> event script to add source-based routing? If so, I'm wondering if
> perhaps something is going wrong there during "takeip" and is being
> fixed later in "ipreallocated".
no, we are not using that script. Should we?
We are only using these:
[root at smtcfc0248 legacy]# ctdb scriptstatus
00.ctdb OK 0.009 Mon Jun 2 18:06:50 2025
01.reclock OK 0.006 Mon Jun 2 18:06:50 2025
05.system OK 0.017 Mon Jun 2 18:06:50 2025
06.nfs OK 0.007 Mon Jun 2 18:06:50 2025
10.interface.debug OK 0.051 Mon Jun 2 18:06:50 2025
60.nfs_debug OK 0.167 Mon Jun 2 18:06:50 2025
> Is anything strange about your routing? In fact, are the clients on
> the same subnet as the server nodes? It shouldn't matter if
> everything is setup sanely.
Well, the one network is a flat network (/16) with a few thousand nodes, the other one (/24) consists of routers on multiple levels. The clients are in different subnets than the servers.
With the former network ARP problems might be more probable than with the latter one. Both networks are completely independent, managed by different companies.
But we see the problems on both, not varying very much in frequency, which most like rules out the network as the culprit.
(in the flat network we have increased the size of the ARP tables on the Linux machines. Must check if we increased it on the switches, too.)
One thing both network have in common is that the server-side interfaces are using LACP.
> One other thing I notice in the relevant (lockd client) kernel code is
> that it calls:
>
> rpc_force_rebind(clnt);
> after logging the "not responding, still trying" message. Without
> digging very deep, that looks like it should be forcing the client to
> reconnect. So, in that case too, we need to be sure the ARPs are
> making it through to the client.
Which should only be relevant for the flat network.
> It would be good if you could tcpdump on the server nodes and on a
> client to determine if the ARPs are being sent... and what is happening
> to the lockd connections before you intervene manually. You should be
> able to construct a filter that captures only relevant gratuitous ARPs
> and TCP SYN packets - if so, you could leave that running in the
> background.
Yeah, client-side this could be problematic as we do not know in advance which clients will fail...
Thx,
Uli
--
Dipl.-Inf. Ulrich Sibiller science + computing ag
System Administration Hagellocher Weg 73
Hotline +49 7071 9457 681 72070 Tuebingen, Germany
https://atos.net/de/deutschland/sc
More information about the samba
mailing list