[Samba] ctdb tcp kill: remaining connections

Mon Feb 13 15:06:26 UTC 2023

Hello,

we are using ctdb 4.15.5 on RHEL8 (Kernel 4.18.0-372.32.1.el8_6.x86_64) to provide NFS v3 (via tcp) to RHEL7/8 clients. Whenever an ip takeover happens most clients report something like this:

[Mon Feb 13 12:21:22 2023] nfs: server x.x.253.252 not responding, still trying
[Mon Feb 13 12:21:28 2023] nfs: server x.x.253.252 not responding, still trying
[Mon Feb 13 12:22:31 2023] nfs: server x.x.253.252 OK
[Mon Feb 13 12:22:31 2023] nfs: server x.x.253.252 OK

And/or

[Mon Feb 13 12:27:01 2023] lockd: server x.x.253.252 not responding, still trying
[Mon Feb 13 12:27:37 2023] lockd: server x.x.253.252 not responding, still trying
[Mon Feb 13 12:27:43 2023] lockd: server x.x.253.252 OK
[Mon Feb 13 12:28:46 2023] lockd: server x.x.253.252 not responding, still trying
[Mon Feb 13 12:28:50 2023] lockd: server x.x.253.252 OK
[Mon Feb 13 12:28:50 2023] lockd: server x.x.253.252 OK

(x.x.253.252 is one of 8 IPs that ctdb handles).

_Some_ of the clients fail to get the nfs mounts alive again after those messages. We then have to reboot those or use a lazy umount. We are seeing this at almost all takeovers.

Because of that we currently cannot update/reboot nodes without affecting many users which to some degree defeats the purpose of ctdb which should provide seamless takeovers...

Today I (again) tried to debug these hanging clients and  came across this output in the ctdb log:
...
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending a TCP RST to for connection x.x.253.85:917 x.x.253.252:599
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending a TCP RST to for connection x.x.253.72:809 x.x.253.252:599
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending a TCP RST to for connection x.x.253.252:2049 53.55.144.116:861
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending a TCP RST to for connection x.x.250.216:983 x.x.253.252:2049
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Killed 230/394 TCP connections to released IP x.x.253.252
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Remaining connections:
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:   x.x.253.252:2049 x.x.247.218:727
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:   x.x.253.252:599 x.x.253.156:686
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:   x.x.253.252:2049 x.x.249.213:810
Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:   x.x.253.252:2049 x.x.253.177:814
...

The log (which I unfortunately do not have anymore) showed 405 of those "Sending a TCP RST" lines in a row which is more than the reported 394.

This output is coming from the releaseip section in /etc/ctdb/events/legacy/10.interface, which calls kill_tcp_connections (in /etc/ctdb/functions) which calls the ctdb_killtcp utility to actually kill the connections. This happens inside a block_ip/unblock_ip guard that temporarily sets up a firewall rule to drop all incoming packages for the ip (x.x.253.252 in this case).

Obviously the tool fails to be 100% successful.

I am wondering about possible reasons for ctdb not killing all affected connections. Are there tunables regarding this behaviour, maybe some timeout that is set too low for our number of connections? Any debugging suggestions?

Thanks,

Uli

-- 
Dipl.-Inf. Ulrich Sibiller           science + computing ag
System Administration                    Hagellocher Weg 73
Hotline +49 7071 9457 681          72070 Tuebingen, Germany
https://atos.net/de/deutschland/sc