smbd crash in a CTDB cluster
anoopcs at autistici.org
anoopcs at autistici.org
Sat Aug 12 02:28:22 UTC 2017
On 2017-08-11 22:47, Richard Sharpe wrote:
> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
> <samba-technical at lists.samba.org> wrote:
>> +samba-technical
>>
>> On 2017-08-11 22:44, anoopcs at autistici.org wrote:
>>>
>>> Hi all,
>>>
>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>> crashes were seen with the following back trace:
>>>
>>> Core was generated by `/usr/sbin/smbd'.
>>> Program terminated with signal 6, Aborted.
>>> #0 0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0 0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> #1 0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>> #2 0x00007f1d286d04de in dump_core () at
>>> ../source3/lib/dumpcore.c:338
>>> #3 0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>> ../source3/lib/util.c:814
>>> #4 0x00007f1d2a79c95f in smb_panic (why=why at entry=0x7f1d2a7e482a
>>> "internal error") at ../lib/util/fault.c:166
>>> #5 0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>> ../lib/util/fault.c:83
>>> #6 sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>> #7 <signal handler called>
>>> #8 messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>> at ../source3/lib/messages_ctdbd.c:278
>>> #9 0x00007f1d286ccd40 in messaging_reinit
>>> (msg_ctx=msg_ctx at entry=0x56508d0e3800) at
>>> ../source3/lib/messages.c:415
>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>> ev_ctx=<optimized out>,
>>> parent_longlived=parent_longlived at entry=true,
>>> comment=comment at entry=0x0) at ../source3/lib/util.c:475
>>> #11 0x00007f1d286dbafa in background_job_waited
>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>> out>) at ../tevent_epoll.c:930
>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>> at ../tevent_standard.c:114
>>> #16 0x00007f1d270dd0cd in _tevent_loop_once
>>> (ev=ev at entry=0x56508d0e2d10,
>>> location=location at entry=0x56508bde85d9
>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>> at ../tevent.c:844
>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>> at ../tevent_standard.c:145
>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>> ../source3/smbd/server.c:2038
>
> This is quite normal if the node was banned when the smbd was forked.
> What does the ctdb log show? What was happening at that time?
I think the logs got rotated and got cleaned up subsequently over time.
I could barely remember that the cluster was not in healthy state at
initial stage due to some network issue. In fact I am not sure whether
node/nodes were in BANNED state or not. I will try to dig that up and
confirm your analysis.
Does that mean it is a deliberate panic from smbd? I am asking this
because of the code re-factoring done in this area which introduces
talloc_get_type_abort() from 4.5 onwards.
>
>>> (gdb) f 8
>>> #8 messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>> at ../source3/lib/messages_ctdbd.c:278
>>> 278 struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>>
>>> (gdb) l
>>> 273
>>> 274 int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
>>> 275 TALLOC_CTX *mem_ctx,
>>> 276 struct messaging_backend *backend)
>>> 277 {
>>> 278 struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>> 279 backend->private_data, struct
>>> messaging_ctdbd_context);
>>> 280 int ret;
>>> 281
>>> 282 ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx,
>>> ctx,
>>> true);
>>>
>>> (gdb) p backend
>>> $1 = (struct messaging_backend *) 0x0
>>>
>>> (gdb) p *msg_ctx
>>> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
>>> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>>> callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
>>> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>>> remote = 0x0, names_db = 0x56508d0e3cf0}
>>>
>>> Since core files were observed later it is hard to recollect the
>>> scenario which could have caused smbd to panic and dump the core.
>>> Please find corresponding logs attached to this mail(log level is
>>> default. not very helpful). Is there any chance by which
>>> msg_ctx->remote can be NULL in this code path? Also the value for vnn
>>> also looks strange..
>>>
>>> Anoop C S
More information about the samba-technical
mailing list