[CTDB] strange fork bomb bug

Amitay Isaacs amitay at gmail.com
Fri Mar 7 05:44:19 MST 2014


On Fri, Mar 7, 2014 at 10:23 PM, Mathieu Parent <math.parent at gmail.com>wrote:

> Hello ctdb devs,
>
> We had a strange ctdb behavior recently on an 8-nodes cluster: each
> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
> so file descriptors each (which is the default linux limit)! Mostly
> :pipe and :socket. It was hard to connect via SSH then, and even the
> process table looked corrupted. Solution was to stop then kill ctdbd.
>
> It seems that when the ctdbd child is blocked, the parent create a new
> one without cleaning the older, untill hiting resource limits.
>
> This was on an old version (Debian 1.12+git20120201-4), I wonder if
> that has been fixed since.
>
> I'm trying to reproduce the issue (with an hard NFS mount).
>
> Regards
> --
> Mathieu Parent
>
>
> NB : log.ctdb had:
>
> 2014/02/28 17:20:57.489370 [ 3486]: Freeze priority 1
> 2014/02/28 17:20:57.489984 [ 3486]: Freeze priority 2
> 2014/02/28 17:20:57.490491 [ 3486]: Freeze priority 3
> 2014/02/28 17:20:57.772767 [ 3486]: Thawing priority 1
> 2014/02/28 17:20:57.772808 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:20:57.772831 [ 3486]: Thawing priority 2
> 2014/02/28 17:20:57.772844 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:20:57.772863 [ 3486]: Thawing priority 3
> 2014/02/28 17:20:57.772874 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:07.955309 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:07.956174 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:07.956808 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:08.222463 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:08.222525 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:08.222561 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:08.222573 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:08.222591 [ 3486]: Thawing priority 3
> 2014/02/28 17:21:08.222602 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:18.406226 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:18.407113 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:18.407711 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:18.675376 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:18.675427 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:18.675450 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:18.675462 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:18.675480 [ 3486]: Thawing priority 3
> 2014/02/28 17:21:18.675490 [ 3486]: Release freeze handler for prio 3
> 2014/02/28 17:21:28.858118 [ 3486]: Freeze priority 1
> 2014/02/28 17:21:28.859022 [ 3486]: Freeze priority 2
> 2014/02/28 17:21:28.859641 [ 3486]: Freeze priority 3
> 2014/02/28 17:21:29.121186 [ 3486]: Thawing priority 1
> 2014/02/28 17:21:29.121239 [ 3486]: Release freeze handler for prio 1
> 2014/02/28 17:21:29.121262 [ 3486]: Thawing priority 2
> 2014/02/28 17:21:29.121274 [ 3486]: Release freeze handler for prio 2
> 2014/02/28 17:21:29.121292 [ 3486]: Thawing priority 3
> [etc.]
>
> and:
> 2014/02/28 17:33:51.254462 [ 3486]: Monitoring event was cancelled
> 2014/02/28 17:33:51.254515 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:29991
> 2014/02/28 17:46:24.428171 [ 3486]: Monitoring event was cancelled
> 2014/02/28 17:46:24.428239 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:27755
> 2014/02/28 18:05:13.859866 [ 3486]: Monitoring event was cancelled
> 2014/02/28 18:05:13.859921 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:8818
> 2014/02/28 18:23:43.043934 [ 3486]: Monitoring event was cancelled
> 2014/02/28 18:23:43.043994 [ 3486]: server/eventscript.c:584 Sending
> SIGTERM to child pid:20176
> [etc.]
>

Do you have more complete logs?  It appears that something is causing
continuous recoveries.  Without additional logs it will be difficult to
figure out what's going on.

Amitay.


More information about the samba-technical mailing list