smbd crash in a CTDB cluster

Fri Aug 11 22:47:52 UTC 2017

On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
<samba-technical at lists.samba.org> wrote:
> +samba-technical
>
> On 2017-08-11 22:44, anoopcs at autistici.org wrote:
>>
>> Hi all,
>>
>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>> crashes were seen with the following back trace:
>>
>> Core was generated by `/usr/sbin/smbd'.
>> Program terminated with signal 6, Aborted.
>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>> #2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>> ../source3/lib/util.c:814
>> #4  0x00007f1d2a79c95f in smb_panic (why=why at entry=0x7f1d2a7e482a
>> "internal error") at ../lib/util/fault.c:166
>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>> ../lib/util/fault.c:83
>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>> #7  <signal handler called>
>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>     at ../source3/lib/messages_ctdbd.c:278
>> #9  0x00007f1d286ccd40 in messaging_reinit
>> (msg_ctx=msg_ctx at entry=0x56508d0e3800) at
>> ../source3/lib/messages.c:415
>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>> ev_ctx=<optimized out>,
>>     parent_longlived=parent_longlived at entry=true,
>> comment=comment at entry=0x0) at ../source3/lib/util.c:475
>> #11 0x00007f1d286dbafa in background_job_waited
>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>> out>) at ../tevent_epoll.c:930
>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent_standard.c:114
>> #16 0x00007f1d270dd0cd in _tevent_loop_once (ev=ev at entry=0x56508d0e2d10,
>>     location=location at entry=0x56508bde85d9
>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent.c:844
>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent_standard.c:145
>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>> ../source3/smbd/server.c:2038

This is quite normal if the node was banned when the smbd was forked.
What does the ctdb log show? What was happening at that time?

>> (gdb) f 8
>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>     at ../source3/lib/messages_ctdbd.c:278
>> 278             struct messaging_ctdbd_context *ctx =
>> talloc_get_type_abort(
>>
>> (gdb) l
>> 273
>> 274     int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
>> 275                                TALLOC_CTX *mem_ctx,
>> 276                                struct messaging_backend *backend)
>> 277     {
>> 278             struct messaging_ctdbd_context *ctx =
>> talloc_get_type_abort(
>> 279                     backend->private_data, struct
>> messaging_ctdbd_context);
>> 280             int ret;
>> 281
>> 282             ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx, ctx,
>> true);
>>
>> (gdb) p backend
>> $1 = (struct messaging_backend *) 0x0
>>
>> (gdb) p *msg_ctx
>> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
>> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>>   callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
>> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>>   remote = 0x0, names_db = 0x56508d0e3cf0}
>>
>> Since core files were observed later it is hard to recollect the
>> scenario which could have caused smbd to panic and dump the core.
>> Please find corresponding logs attached to this mail(log level is
>> default. not very helpful). Is there any chance by which
>> msg_ctx->remote can be NULL in this code path? Also the value for vnn
>> also looks strange..
>>
>> Anoop C S

-- 
Regards,
Richard Sharpe
(何以解憂？唯有杜康。--曹操)