Fw: Persistent crash of ctdb on AIX7

Thu Aug 30 13:59:12 UTC 2018

   So, I've been trying to get things running properly on AIX (with GPFS),
   and am getting a persistent crash for a particular test case.  My code
   is presently compiled for 32 bit, unfortunately.    (I will be
   addressing the 64 bit issues later, since that opens several other cans
   of worms).

   The easiest way to generate the crash it is to run smbtorture nbench
   into a 3 machine cluster, through a load balancer.  I have failover
   disabled, and only use a rewritten 00.ctdb.script as my only event
   scripts.   (I manually stop and start smbd for now)

   After nbench runs for 30 - 60 seconds, I deliberately shutdown one of
   the 3 nodes.   The 2 remaining nodes chug along for about 10 seconds,
   and then one of those will panic.  Eventually, the last remaining node
   will begin dropping connections.  Sometimes, it is able to recover,
   other times it's in a bad state.    The stack trace for the panicing
   node, is very consistent,  a SIGSEGV generated at line 575 of
   ctdb_recover.c:

      (dbx) where
      pth_signal.pthread_kill(??, ??) at 0xd0546a94
      pth_signal._p_raise(??) at 0xd0545ee4
      raise.raise(??) at 0xd0120e20
      abort.abort() at 0xd017c964
      fault.smb_panic_default(fault.smb_panic_default.why = "internal
      error"), line 160 in "fault.c"
      fault.smb_panic(fault.smb_panic.why = "internal error"), line 173 in
      "fault.c"
      fault.fault_report(fault.fault_report.sig = 11), line 84 in "fault.c"
      fault.sig_fault(fault.sig_fault.sig = 11), line 95 in "fault.c"
      unnamed block in ctdb_recover.db_push_msg_handler
      (ctdb_recover.db_push_msg_handler.srvid = 17294104044079415326,
      indata = (...), ctdb_recover.db_push_msg_handler.private_data =
      0x20026a58), line 575 in "ctdb_recover.c"
      ctdb_recover.db_push_msg_handler
      (ctdb_recover.db_push_msg_handler.srvid = 17294104044079415326,
      indata = (...), ctdb_recover.db_push_msg_handler.private_data =
      0x20026a58), line 575 in "ctdb_recover.c"
      srvid.srvid_dispatch(srvid.srvid_dispatch.srv = 0x20012598,
      srvid.srvid_dispatch.srvid = 17294104044079415326,
      srvid.srvid_dispatch.srvid_all = 18446744073709551615, data = (...)),
      line 264 in "srvid.c"
      ctdb_client.ctdb_request_message
      (ctdb_client.ctdb_request_message.ctdb = 0x20019548,
      ctdb_client.ctdb_request_message.hdr = 0x20f2cf28), line 203 in
      "ctdb_client.c"
      ctdb_daemon.daemon_request_message_from_client
      (ctdb_daemon.daemon_request_message_from_client.client = 0x20026878,
      ctdb_daemon.daemon_request_message_from_client.c = 0x20f2cf28), line
      291 in "ctdb_daemon.c"
      ctdb_daemon.daemon_incoming_packet
      (ctdb_daemon.daemon_incoming_packet.p = 0x20026878,
      ctdb_daemon.daemon_incoming_packet.hdr = 0x20f2cf28), line 855 in
      "ctdb_daemon.c"
      ctdb_daemon.ctdb_daemon_read_cb(ctdb_daemon.ctdb_daemon_read_cb.data
      = "", ctdb_daemon.ctdb_daemon_read_cb.cnt = 87376,
      ctdb_daemon.ctdb_daemon_read_cb.args = 0x20026878), line 914 in
      "ctdb_daemon.c"
      ctdb_io.queue_process(ctdb_io.queue_process.queue = 0x20084678), line
      143 in "ctdb_io.c"
      ctdb_io.queue_io_read(ctdb_io.queue_io_read.queue = 0x20084678), line
      220 in "ctdb_io.c"
      ctdb_io.queue_io_handler(ctdb_io.queue_io_handler.ev = 0x2004efd8,
      ctdb_io.queue_io_handler.fde = 0x2008f468,
      ctdb_io.queue_io_handler.flags = 1,
      ctdb_io.queue_io_handler.private_data = 0x20084678), line 290 in
      "ctdb_io.c"
      tevent_fd.tevent_common_invoke_fd_handler
      (tevent_fd.tevent_common_invoke_fd_handler.fde = 0x2008f468,
      tevent_fd.tevent_common_invoke_fd_handler.flags = 1,
      tevent_fd.tevent_common_invoke_fd_handler.removed = (nil)), line 137
      in "tevent_fd.c"
      unnamed block in tevent_poll.poll_event_loop_poll
      (tevent_poll.poll_event_loop_poll.ev = 0x2004efd8,
      tevent_poll.poll_event_loop_poll.tvalp = 0x2ff221c0), line 569 in
      "tevent_poll.c"
      tevent_poll.poll_event_loop_poll(tevent_poll.poll_event_loop_poll.ev
      = 0x2004efd8, tevent_poll.poll_event_loop_poll.tvalp = 0x2ff221c0),
      line 569 in "tevent_poll.c"
      tevent_poll.poll_event_loop_once(tevent_poll.poll_event_loop_once.ev
      = 0x2004efd8, tevent_poll.poll_event_loop_once.location =
      "../ctdb/server/ctdb_daemon.c:1394"), line 626 in "tevent_poll.c"
      tevent._tevent_loop_once(tevent._tevent_loop_once.ev = 0x2004efd8,
      tevent._tevent_loop_once.location =
      "../ctdb/server/ctdb_daemon.c:1394"), line 772 in "tevent.c"
      unnamed block in tevent.tevent_common_loop_wait
      (tevent.tevent_common_loop_wait.ev = 0x2004efd8,
      tevent.tevent_common_loop_wait.location =
      "../ctdb/server/ctdb_daemon.c:1394"), line 895 in "tevent.c"
      tevent.tevent_common_loop_wait(tevent.tevent_common_loop_wait.ev =
      0x2004efd8, tevent.tevent_common_loop_wait.location =
      "../ctdb/server/ctdb_daemon.c:1394"), line 895 in "tevent.c"
      tevent._tevent_loop_wait(tevent._tevent_loop_wait.ev = 0x2004efd8,
      tevent._tevent_loop_wait.location =
      "../ctdb/server/ctdb_daemon.c:1394"), line 914 in "tevent.c"
      ctdb_daemon.ctdb_start_daemon(ctdb_daemon.ctdb_start_daemon.ctdb =
      0x20019548, ctdb_daemon.ctdb_start_daemon.do_fork = @0x01000001),
      line 1394 in "ctdb_daemon.c"
      ctdbd.main(ctdbd.main.argc = 1, ctdbd.main.argv = 0x2ff224bc), line
      384 in "ctdbd.c"

      *********************************************************************************************************************************

Looking at the logs, I will often see a number of complaints about a record
in smbXsrv_open_global.tdb not in queue (from ctdb_vacuum.c), after some
successful push_db calls.

I will also sometimes see a message like this:

queue_io_read: read error realloc failed for 385875968.
The number seems  too large for pkt_size, to me.

Since there was quite a bit of discussion between Swen and Amitay recently
about Talloc Pools (in Dec 2017) and the using these calls within tevent
handlers, it seems relevant. I'm still getting my head around the tevent
library.

I'm very current and this code that produced this stack trace is at commit
ffa1c949c62, which Jeremy signed off on recently.    I have a small set of
patches for AIX that address what are mainly compilation issues.

Would appreciate any advice or insight on how to isolate the culprit here,
since something upstream seems to be mangling the queue.   I suspect it's a
talloc steal or destructor, but I am still crawling through the weeds.
Wondering whether there's anything clever I can do to dump or list the
tevent queue(s) at various checkpoints.   Or, are there any talloc trace
tricks that might help me flush this out?

I saw Martin dropped some patches this morning that may help, not sure yet.
WIll post if they do.