[CTDB] strange fork bomb bug

Wed Mar 26 08:53:41 MDT 2014

2014-03-18 5:01 GMT+01:00 Amitay Isaacs <amitay at gmail.com>:
> Hi Mathieu,
>
> I need the logs from all the nodes to figure out what's going on.  I can see
> that this particular node is continuously going in recovery but nothing to
> explain why the node is going in recovery.
>
> Amitay.
>
> On Fri, Mar 7, 2014 at 11:59 PM, Mathieu Parent <math.parent at gmail.com>
> wrote:
>>
>> Hello Amitay,
>>
>> See the attached complete log.
>>
>> 10.15.70.1 is in /etc/ctdb/nodes and not in /etc/ctdb/public_addresses.
>>
>>
>> 2014-03-07 13:44 GMT+01:00 Amitay Isaacs <amitay at gmail.com>:
>> > On Fri, Mar 7, 2014 at 10:23 PM, Mathieu Parent <math.parent at gmail.com>
>> > wrote:
>> >>
>> >> Hello ctdb devs,
>> >>
>> >> We had a strange ctdb behavior recently on an 8-nodes cluster: each
>> >> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
>> >> so file descriptors each (which is the default linux limit)! Mostly
>> >> :pipe and :socket. It was hard to connect via SSH then, and even the
>> >> process table looked corrupted. Solution was to stop then kill ctdbd.
>> >>
>> >> It seems that when the ctdbd child is blocked, the parent create a new
>> >> one without cleaning the older, untill hiting resource limits.
>> >>
>> >> This was on an old version (Debian 1.12+git20120201-4), I wonder if
>> >> that has been fixed since.
>> >>
>> >> I'm trying to reproduce the issue (with an hard NFS mount).
>> >>
>> >> Regards
>> >> --
>> >> Mathieu Parent
>> >>
>> >>
>> >> NB : log.ctdb had:
>> >>
>> >> 2014/02/28 17:20:57.489370 [ 3486]: Freeze priority 1
>> >> 2014/02/28 17:20:57.489984 [ 3486]: Freeze priority 2
>> >> 2014/02/28 17:20:57.490491 [ 3486]: Freeze priority 3
>> >> 2014/02/28 17:20:57.772767 [ 3486]: Thawing priority 1
>> >> 2014/02/28 17:20:57.772808 [ 3486]: Release freeze handler for prio 1
>> >> 2014/02/28 17:20:57.772831 [ 3486]: Thawing priority 2
>> >> 2014/02/28 17:20:57.772844 [ 3486]: Release freeze handler for prio 2
>> >> 2014/02/28 17:20:57.772863 [ 3486]: Thawing priority 3
>> >> 2014/02/28 17:20:57.772874 [ 3486]: Release freeze handler for prio 3
>> >> 2014/02/28 17:21:07.955309 [ 3486]: Freeze priority 1
>> >> 2014/02/28 17:21:07.956174 [ 3486]: Freeze priority 2
>> >> 2014/02/28 17:21:07.956808 [ 3486]: Freeze priority 3
>> >> 2014/02/28 17:21:08.222463 [ 3486]: Thawing priority 1
>> >> 2014/02/28 17:21:08.222525 [ 3486]: Release freeze handler for prio 1
>> >> 2014/02/28 17:21:08.222561 [ 3486]: Thawing priority 2
>> >> 2014/02/28 17:21:08.222573 [ 3486]: Release freeze handler for prio 2
>> >> 2014/02/28 17:21:08.222591 [ 3486]: Thawing priority 3
>> >> 2014/02/28 17:21:08.222602 [ 3486]: Release freeze handler for prio 3
>> >> 2014/02/28 17:21:18.406226 [ 3486]: Freeze priority 1
>> >> 2014/02/28 17:21:18.407113 [ 3486]: Freeze priority 2
>> >> 2014/02/28 17:21:18.407711 [ 3486]: Freeze priority 3
>> >> 2014/02/28 17:21:18.675376 [ 3486]: Thawing priority 1
>> >> 2014/02/28 17:21:18.675427 [ 3486]: Release freeze handler for prio 1
>> >> 2014/02/28 17:21:18.675450 [ 3486]: Thawing priority 2
>> >> 2014/02/28 17:21:18.675462 [ 3486]: Release freeze handler for prio 2
>> >> 2014/02/28 17:21:18.675480 [ 3486]: Thawing priority 3
>> >> 2014/02/28 17:21:18.675490 [ 3486]: Release freeze handler for prio 3
>> >> 2014/02/28 17:21:28.858118 [ 3486]: Freeze priority 1
>> >> 2014/02/28 17:21:28.859022 [ 3486]: Freeze priority 2
>> >> 2014/02/28 17:21:28.859641 [ 3486]: Freeze priority 3
>> >> 2014/02/28 17:21:29.121186 [ 3486]: Thawing priority 1
>> >> 2014/02/28 17:21:29.121239 [ 3486]: Release freeze handler for prio 1
>> >> 2014/02/28 17:21:29.121262 [ 3486]: Thawing priority 2
>> >> 2014/02/28 17:21:29.121274 [ 3486]: Release freeze handler for prio 2
>> >> 2014/02/28 17:21:29.121292 [ 3486]: Thawing priority 3
>> >> [etc.]
>> >>
>> >> and:
>> >> 2014/02/28 17:33:51.254462 [ 3486]: Monitoring event was cancelled
>> >> 2014/02/28 17:33:51.254515 [ 3486]: server/eventscript.c:584 Sending
>> >> SIGTERM to child pid:29991
>> >> 2014/02/28 17:46:24.428171 [ 3486]: Monitoring event was cancelled
>> >> 2014/02/28 17:46:24.428239 [ 3486]: server/eventscript.c:584 Sending
>> >> SIGTERM to child pid:27755
>> >> 2014/02/28 18:05:13.859866 [ 3486]: Monitoring event was cancelled
>> >> 2014/02/28 18:05:13.859921 [ 3486]: server/eventscript.c:584 Sending
>> >> SIGTERM to child pid:8818
>> >> 2014/02/28 18:23:43.043934 [ 3486]: Monitoring event was cancelled
>> >> 2014/02/28 18:23:43.043994 [ 3486]: server/eventscript.c:584 Sending
>> >> SIGTERM to child pid:20176
>> >> [etc.]
>> >
>> >
>> > Do you have more complete logs?  It appears that something is causing
>> > continuous recoveries.  Without additional logs it will be difficult to
>> > figure out what's going on.

Here are the complete logs. I don't have the node8 ctdb.log.

-- 
Mathieu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb-logs.tar.xz
Type: application/x-xz
Size: 2548376 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20140326/d456db71/attachment-0001.bin>