smbd crash in a CTDB cluster

Sat Aug 12 02:28:22 UTC 2017

On 2017-08-11 22:47, Richard Sharpe wrote:
> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
> <samba-technical at lists.samba.org> wrote:
>> +samba-technical
>> 
>> On 2017-08-11 22:44, anoopcs at autistici.org wrote:
>>> 
>>> Hi all,
>>> 
>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>> crashes were seen with the following back trace:
>>> 
>>> Core was generated by `/usr/sbin/smbd'.
>>> Program terminated with signal 6, Aborted.
>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>> #2  0x00007f1d286d04de in dump_core () at 
>>> ../source3/lib/dumpcore.c:338
>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>> ../source3/lib/util.c:814
>>> #4  0x00007f1d2a79c95f in smb_panic (why=why at entry=0x7f1d2a7e482a
>>> "internal error") at ../lib/util/fault.c:166
>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>> ../lib/util/fault.c:83
>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>> #7  <signal handler called>
>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>>     at ../source3/lib/messages_ctdbd.c:278
>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>> (msg_ctx=msg_ctx at entry=0x56508d0e3800) at
>>> ../source3/lib/messages.c:415
>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>> ev_ctx=<optimized out>,
>>>     parent_longlived=parent_longlived at entry=true,
>>> comment=comment at entry=0x0) at ../source3/lib/util.c:475
>>> #11 0x00007f1d286dbafa in background_job_waited
>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>> out>) at ../tevent_epoll.c:930
>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent_standard.c:114
>>> #16 0x00007f1d270dd0cd in _tevent_loop_once 
>>> (ev=ev at entry=0x56508d0e2d10,
>>>     location=location at entry=0x56508bde85d9
>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent.c:844
>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent_standard.c:145
>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>> ../source3/smbd/server.c:2038
> 
> This is quite normal if the node was banned when the smbd was forked.
> What does the ctdb log show? What was happening at that time?

I think the logs got rotated and got cleaned up subsequently over time. 
I could barely remember that the cluster was not in healthy state at 
initial stage due to some network issue. In fact I am not sure whether 
node/nodes were in BANNED state or not. I will try to dig that up and 
confirm your analysis.

Does that mean it is a deliberate panic from smbd? I am asking this 
because of the code re-factoring done in this area which introduces 
talloc_get_type_abort() from 4.5 onwards.

> 
>>> (gdb) f 8
>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>>     at ../source3/lib/messages_ctdbd.c:278
>>> 278             struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>> 
>>> (gdb) l
>>> 273
>>> 274     int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
>>> 275                                TALLOC_CTX *mem_ctx,
>>> 276                                struct messaging_backend *backend)
>>> 277     {
>>> 278             struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>> 279                     backend->private_data, struct
>>> messaging_ctdbd_context);
>>> 280             int ret;
>>> 281
>>> 282             ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx, 
>>> ctx,
>>> true);
>>> 
>>> (gdb) p backend
>>> $1 = (struct messaging_backend *) 0x0
>>> 
>>> (gdb) p *msg_ctx
>>> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
>>> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>>>   callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
>>> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>>>   remote = 0x0, names_db = 0x56508d0e3cf0}
>>> 
>>> Since core files were observed later it is hard to recollect the
>>> scenario which could have caused smbd to panic and dump the core.
>>> Please find corresponding logs attached to this mail(log level is
>>> default. not very helpful). Is there any chance by which
>>> msg_ctx->remote can be NULL in this code path? Also the value for vnn
>>> also looks strange..
>>> 
>>> Anoop C S