smbd crash in a CTDB cluster

Sat Aug 12 17:45:49 UTC 2017

On Sat, Aug 12, 2017 at 10:24 AM,  <anoopcs at autistici.org> wrote:
> On 2017-08-12 03:35, Richard Sharpe wrote:
>>
>> On Fri, Aug 11, 2017 at 7:28 PM,  <anoopcs at autistici.org> wrote:
>>>
>>> On 2017-08-11 22:47, Richard Sharpe wrote:
>>>>
>>>>
>>>> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
>>>> <samba-technical at lists.samba.org> wrote:
>>>>>
>>>>>
>>>>> +samba-technical
>>>>>
>>>>> On 2017-08-11 22:44, anoopcs at autistici.org wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>>>>> crashes were seen with the following back trace:
>>>>>>
>>>>>> Core was generated by `/usr/sbin/smbd'.
>>>>>> Program terminated with signal 6, Aborted.
>>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>>> (gdb) bt
>>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>>>>> #2  0x00007f1d286d04de in dump_core () at
>>>>>> ../source3/lib/dumpcore.c:338
>>>>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>>>>> ../source3/lib/util.c:814
>>>>>> #4  0x00007f1d2a79c95f in smb_panic (why=why at entry=0x7f1d2a7e482a
>>>>>> "internal error") at ../lib/util/fault.c:166
>>>>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>>>>> ../lib/util/fault.c:83
>>>>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>>>>> #7  <signal handler called>
>>>>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>>>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>>>>>     at ../source3/lib/messages_ctdbd.c:278
>>>>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>>>>> (msg_ctx=msg_ctx at entry=0x56508d0e3800) at
>>>>>> ../source3/lib/messages.c:415
>>>>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>>>>> ev_ctx=<optimized out>,
>>>>>>     parent_longlived=parent_longlived at entry=true,
>>>>>> comment=comment at entry=0x0) at ../source3/lib/util.c:475
>>>>>> #11 0x00007f1d286dbafa in background_job_waited
>>>>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>>>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>>>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>>>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>>>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>>>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>>>>> out>) at ../tevent_epoll.c:930
>>>>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent_standard.c:114
>>>>>> #16 0x00007f1d270dd0cd in _tevent_loop_once
>>>>>> (ev=ev at entry=0x56508d0e2d10,
>>>>>>     location=location at entry=0x56508bde85d9
>>>>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>>>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent.c:844
>>>>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent_standard.c:145
>>>>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>>>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>>>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>>>>> ../source3/smbd/server.c:2038
>>>>
>>>>
>>>>
>>>> This is quite normal if the node was banned when the smbd was forked.
>>>> What does the ctdb log show? What was happening at that time?
>>>
>>>
>>>
>>> I think the logs got rotated and got cleaned up subsequently over time. I
>>> could barely remember that the cluster was not in healthy state at
>>> initial
>>> stage due to some network issue. In fact I am not sure whether node/nodes
>>> were in BANNED state or not. I will try to dig that up and confirm your
>>> analysis.
>>>
>>> Does that mean it is a deliberate panic from smbd? I am asking this
>>> because
>>> of the code re-factoring done in this area which introduces
>>> talloc_get_type_abort() from 4.5 onwards.
>>
>>
>> Samba will panic in reinit_after_fork if anything fails. There are
>> very important things going on there, including connecting to ctdb.
>
>
> I have seen such panics in situations where smbd is unable to connect to
> ctdb when non-public IPs were used for client connection and corresponding
> node is in DISCONNECTED/BANNED/STOPPED state. Following is one of that kind:
>
> (gdb) bt
> #0  0x00007fdbcf38d1f7 in raise () from /lib64/libc.so.6
> #1  0x00007fdbcf38e8e8 in abort () from /lib64/libc.so.6
> #2  0x00007fdbd0d134de in dump_core () at ../source3/lib/dumpcore.c:338
> #3  0x00007fdbd0d046e7 in smb_panic_s3 (why=<optimized out>) at
> ../source3/lib/util.c:814
> #4  0x00007fdbd2ddf95f in smb_panic (why=why at entry=0x55f401ef8a50
> "reinit_after_fork() failed") at ../lib/util/fault.c:166
> #5  0x000055f401ef7af8 in smbd_accept_connection (ev=0x55f4023836c0,
> fde=<optimized out>, flags=<optimized out>,
>     private_data=<optimized out>) at ../source3/smbd/server.c:1014
> #6  0x00007fdbcf725edb in epoll_event_loop (tvalp=0x7fff5b14c720,
> epoll_ev=0x55f402383940) at ../tevent_epoll.c:728
> #7  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at
> ../tevent_epoll.c:930
> #8  0x00007fdbcf7242a7 in std_event_loop_once (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:114
> #9  0x00007fdbcf7200cd in _tevent_loop_once (ev=ev at entry=0x55f4023836c0,
>     location=location at entry=0x55f401efb5d9 "../source3/smbd/server.c:1384")
> at ../tevent.c:721
> #10 0x00007fdbcf7202fb in tevent_common_loop_wait (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent.c:844
> #11 0x00007fdbcf724247 in std_event_loop_wait (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:145
> #12 0x000055f401ef2a95 in smbd_parent_loop (parent=<optimized out>,
> ev_ctx=0x55f4023836c0) at ../source3/smbd/server.c:1384
> #13 main (argc=<optimized out>, argv=<optimized out>) at
> ../source3/smbd/server.c:2038
>
> The above backtrace is different from what I have observed right now as the
> smb_panic message says its an "internal error" rather than
> "reinit_after_fork() failed". Moreover the presence of
> tevent_common_loop_timer_delay() in the original backtrace which is invoked
> when epoll_wait() gets timed out inside epoll_event_loop() makes me think
> this situation to be fishy. Adding some details from original backtrace:
>
> (gdb) f 11
> #11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0ec8e0) at
> ../source3/lib/background.c:179
>
> (gdb) p *state
> $1 = {ev = 0x56508d0e2d10, msg = 0x56508d0e3800, trigger_msgs = 0x0,
> num_trigger_msgs = 0, parent_longlived = false,
>   fn = 0x7f1d286cc420 <mess_parent_dgm_cleanup>, private_data =
> 0x56508d0e3800, wakeup_req = 0x0, pipe_fd = -1, pipe_req = 0x0}
>
> (gdb) p *req
> $2 = {async = {fn = 0x7f1d286cc2d0 <mess_parent_dgm_cleanup_done>,
> private_data = 0x56508d0e3800}, data = 0x56508d0ec710,
>   private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
> state = TEVENT_REQ_INIT}, internal = {
>     private_type = 0x7f1d286f6457 "struct background_job_state",
> create_location = 0x7f1d286f63b8 "../source3/lib/background.c:61",
>     finish_location = 0x0, cancel_location = 0x0, state =
> TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x56508d0ec660,
>     defer_callback_ev = 0x0, timer = 0x0}}
>
> Somehow I thought of a possible race condition but without much context
> around epoll and messaging initialization I cannot prove. Of course I am not
> an expert... just sharing my findings. I don't know whether it is worth for
> a second thought on the issue. Please correct me if I completely diverted in
> my debugging process.

OK, you are correct. This is different.

-- 
Regards,
Richard Sharpe
(何以解憂？唯有杜康。--曹操)