[CTDB] strange fork bomb bug

Mon Mar 17 22:01:23 MDT 2014

Hi Mathieu,

I need the logs from all the nodes to figure out what's going on.  I can
see that this particular node is continuously going in recovery but nothing
to explain why the node is going in recovery.

Amitay.

On Fri, Mar 7, 2014 at 11:59 PM, Mathieu Parent <math.parent at gmail.com>wrote:

> Hello Amitay,
>
> See the attached complete log.
>
> 10.15.70.1 is in /etc/ctdb/nodes and not in /etc/ctdb/public_addresses.
>
>
> 2014-03-07 13:44 GMT+01:00 Amitay Isaacs <amitay at gmail.com>:
> > On Fri, Mar 7, 2014 at 10:23 PM, Mathieu Parent <math.parent at gmail.com>
> > wrote:
> >>
> >> Hello ctdb devs,
> >>
> >> We had a strange ctdb behavior recently on an 8-nodes cluster: each
> >> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
> >> so file descriptors each (which is the default linux limit)! Mostly
> >> :pipe and :socket. It was hard to connect via SSH then, and even the
> >> process table looked corrupted. Solution was to stop then kill ctdbd.
> >>
> >> It seems that when the ctdbd child is blocked, the parent create a new
> >> one without cleaning the older, untill hiting resource limits.
> >>
> >> This was on an old version (Debian 1.12+git20120201-4), I wonder if
> >> that has been fixed since.
> >>
> >> I'm trying to reproduce the issue (with an hard NFS mount).
> >>
> >> Regards
> >> --
> >> Mathieu Parent
> >>
> >>
> >> NB : log.ctdb had:
> >>
> >> 2014/02/28 17:20:57.489370 [ 3486]: Freeze priority 1
> >> 2014/02/28 17:20:57.489984 [ 3486]: Freeze priority 2
> >> 2014/02/28 17:20:57.490491 [ 3486]: Freeze priority 3
> >> 2014/02/28 17:20:57.772767 [ 3486]: Thawing priority 1
> >> 2014/02/28 17:20:57.772808 [ 3486]: Release freeze handler for prio 1
> >> 2014/02/28 17:20:57.772831 [ 3486]: Thawing priority 2
> >> 2014/02/28 17:20:57.772844 [ 3486]: Release freeze handler for prio 2
> >> 2014/02/28 17:20:57.772863 [ 3486]: Thawing priority 3
> >> 2014/02/28 17:20:57.772874 [ 3486]: Release freeze handler for prio 3
> >> 2014/02/28 17:21:07.955309 [ 3486]: Freeze priority 1
> >> 2014/02/28 17:21:07.956174 [ 3486]: Freeze priority 2
> >> 2014/02/28 17:21:07.956808 [ 3486]: Freeze priority 3
> >> 2014/02/28 17:21:08.222463 [ 3486]: Thawing priority 1
> >> 2014/02/28 17:21:08.222525 [ 3486]: Release freeze handler for prio 1
> >> 2014/02/28 17:21:08.222561 [ 3486]: Thawing priority 2
> >> 2014/02/28 17:21:08.222573 [ 3486]: Release freeze handler for prio 2
> >> 2014/02/28 17:21:08.222591 [ 3486]: Thawing priority 3
> >> 2014/02/28 17:21:08.222602 [ 3486]: Release freeze handler for prio 3
> >> 2014/02/28 17:21:18.406226 [ 3486]: Freeze priority 1
> >> 2014/02/28 17:21:18.407113 [ 3486]: Freeze priority 2
> >> 2014/02/28 17:21:18.407711 [ 3486]: Freeze priority 3
> >> 2014/02/28 17:21:18.675376 [ 3486]: Thawing priority 1
> >> 2014/02/28 17:21:18.675427 [ 3486]: Release freeze handler for prio 1
> >> 2014/02/28 17:21:18.675450 [ 3486]: Thawing priority 2
> >> 2014/02/28 17:21:18.675462 [ 3486]: Release freeze handler for prio 2
> >> 2014/02/28 17:21:18.675480 [ 3486]: Thawing priority 3
> >> 2014/02/28 17:21:18.675490 [ 3486]: Release freeze handler for prio 3
> >> 2014/02/28 17:21:28.858118 [ 3486]: Freeze priority 1
> >> 2014/02/28 17:21:28.859022 [ 3486]: Freeze priority 2
> >> 2014/02/28 17:21:28.859641 [ 3486]: Freeze priority 3
> >> 2014/02/28 17:21:29.121186 [ 3486]: Thawing priority 1
> >> 2014/02/28 17:21:29.121239 [ 3486]: Release freeze handler for prio 1
> >> 2014/02/28 17:21:29.121262 [ 3486]: Thawing priority 2
> >> 2014/02/28 17:21:29.121274 [ 3486]: Release freeze handler for prio 2
> >> 2014/02/28 17:21:29.121292 [ 3486]: Thawing priority 3
> >> [etc.]
> >>
> >> and:
> >> 2014/02/28 17:33:51.254462 [ 3486]: Monitoring event was cancelled
> >> 2014/02/28 17:33:51.254515 [ 3486]: server/eventscript.c:584 Sending
> >> SIGTERM to child pid:29991
> >> 2014/02/28 17:46:24.428171 [ 3486]: Monitoring event was cancelled
> >> 2014/02/28 17:46:24.428239 [ 3486]: server/eventscript.c:584 Sending
> >> SIGTERM to child pid:27755
> >> 2014/02/28 18:05:13.859866 [ 3486]: Monitoring event was cancelled
> >> 2014/02/28 18:05:13.859921 [ 3486]: server/eventscript.c:584 Sending
> >> SIGTERM to child pid:8818
> >> 2014/02/28 18:23:43.043934 [ 3486]: Monitoring event was cancelled
> >> 2014/02/28 18:23:43.043994 [ 3486]: server/eventscript.c:584 Sending
> >> SIGTERM to child pid:20176
> >> [etc.]
> >
> >
> > Do you have more complete logs?  It appears that something is causing
> > continuous recoveries.  Without additional logs it will be difficult to
> > figure out what's going on.
> >
> > Amitay.
>
>
>
> --
> Mathieu
>