[CTDB] strange fork bomb bug

Fri Mar 7 05:44:19 MST 2014

On Fri, Mar 7, 2014 at 10:23 PM, Mathieu Parent <math.parent at gmail.com>wrote:

> Hello ctdb devs,
>
> We had a strange ctdb behavior recently on an 8-nodes cluster: each
> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
> so file descriptors each (which is the default linux limit)! Mostly
> :pipe and :socket. It was hard to connect via SSH then, and even the
> process table looked corrupted. Solution was to stop then kill ctdbd.
>
> It seems that when the ctdbd child is blocked, the parent create a new
> one without cleaning the older, untill hiting resource limits.
>
> This was on an old version (Debian 1.12+git20120201-4), I wonder if
> that has been fixed since.
>
> I'm trying to reproduce the issue (with an hard NFS mount).
>
> Regards
> --
> Mathieu Parent
>
>
> NB : log.ctdb had:
>
> 2014/02/28 17:20:57.489370 [ 3486]: Freeze priority 1
> 2014/02/28 17:20:57.489984 [ 3486]: Freeze priority 2
> 2014/02/28 17:20:57.490491 [ 3486]: Freeze priority 3
> 2014/02/28 17:20:57.772767 [ 3486]: Thawing priority 1
> 2014/02/28 17:20:57.772808 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:20:57.772831 [ 3486]: Thawing priority 2
> 2014/02/28 17:20:57.772844 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:20:57.772863 [ 3486]: Thawing priority 3
> 2014/02/28 17:20:57.772874 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:07.955309 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:07.956174 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:07.956808 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:08.222463 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:08.222525 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:08.222561 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:08.222573 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:08.222591 [ 3486]: Thawing priority 3
> 2014/02/28 17:21:08.222602 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:18.406226 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:18.407113 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:18.407711 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:18.675376 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:18.675427 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:18.675450 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:18.675462 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:18.675480 [ 3486]: Thawing priority 3
> 2014/02/28 17:21:18.675490 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:28.858118 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:28.859022 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:28.859641 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:29.121186 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:29.121239 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:29.121262 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:29.121274 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:29.121292 [ 3486]: Thawing priority 3
> [etc.]
>
> and:
> 2014/02/28 17:33:51.254462 [ 3486]: Monitoring event was cancelled
> 2014/02/28 17:33:51.254515 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:29991
> 2014/02/28 17:46:24.428171 [ 3486]: Monitoring event was cancelled
> 2014/02/28 17:46:24.428239 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:27755
> 2014/02/28 18:05:13.859866 [ 3486]: Monitoring event was cancelled
> 2014/02/28 18:05:13.859921 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:8818
> 2014/02/28 18:23:43.043934 [ 3486]: Monitoring event was cancelled
> 2014/02/28 18:23:43.043994 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:20176
> [etc.]
>

Do you have more complete logs?  It appears that something is causing
continuous recoveries.  Without additional logs it will be difficult to
figure out what's going on.

Amitay.