smbd crash in a CTDB cluster

anoopcs at autistici.org anoopcs at autistici.org
Sat Aug 12 17:24:13 UTC 2017


On 2017-08-12 03:35, Richard Sharpe wrote:
> On Fri, Aug 11, 2017 at 7:28 PM,  <anoopcs at autistici.org> wrote:
>> On 2017-08-11 22:47, Richard Sharpe wrote:
>>> 
>>> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
>>> <samba-technical at lists.samba.org> wrote:
>>>> 
>>>> +samba-technical
>>>> 
>>>> On 2017-08-11 22:44, anoopcs at autistici.org wrote:
>>>>> 
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>>>> crashes were seen with the following back trace:
>>>>> 
>>>>> Core was generated by `/usr/sbin/smbd'.
>>>>> Program terminated with signal 6, Aborted.
>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>> (gdb) bt
>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>>>> #2  0x00007f1d286d04de in dump_core () at 
>>>>> ../source3/lib/dumpcore.c:338
>>>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>>>> ../source3/lib/util.c:814
>>>>> #4  0x00007f1d2a79c95f in smb_panic (why=why at entry=0x7f1d2a7e482a
>>>>> "internal error") at ../lib/util/fault.c:166
>>>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>>>> ../lib/util/fault.c:83
>>>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>>>> #7  <signal handler called>
>>>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx at entry=0x56508d0e3800,
>>>>> mem_ctx=mem_ctx at entry=0x56508d0e3800, backend=0x0)
>>>>>     at ../source3/lib/messages_ctdbd.c:278
>>>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>>>> (msg_ctx=msg_ctx at entry=0x56508d0e3800) at
>>>>> ../source3/lib/messages.c:415
>>>>> #10 0x00007f1d286c0ec9 in reinit_after_fork 
>>>>> (msg_ctx=0x56508d0e3800,
>>>>> ev_ctx=<optimized out>,
>>>>>     parent_longlived=parent_longlived at entry=true,
>>>>> comment=comment at entry=0x0) at ../source3/lib/util.c:475
>>>>> #11 0x00007f1d286dbafa in background_job_waited
>>>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>>>> out>) at ../tevent_epoll.c:930
>>>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent_standard.c:114
>>>>> #16 0x00007f1d270dd0cd in _tevent_loop_once 
>>>>> (ev=ev at entry=0x56508d0e2d10,
>>>>>     location=location at entry=0x56508bde85d9
>>>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait 
>>>>> (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent.c:844
>>>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent_standard.c:145
>>>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>>>> ../source3/smbd/server.c:2038
>>> 
>>> 
>>> This is quite normal if the node was banned when the smbd was forked.
>>> What does the ctdb log show? What was happening at that time?
>> 
>> 
>> I think the logs got rotated and got cleaned up subsequently over 
>> time. I
>> could barely remember that the cluster was not in healthy state at 
>> initial
>> stage due to some network issue. In fact I am not sure whether 
>> node/nodes
>> were in BANNED state or not. I will try to dig that up and confirm 
>> your
>> analysis.
>> 
>> Does that mean it is a deliberate panic from smbd? I am asking this 
>> because
>> of the code re-factoring done in this area which introduces
>> talloc_get_type_abort() from 4.5 onwards.
> 
> Samba will panic in reinit_after_fork if anything fails. There are
> very important things going on there, including connecting to ctdb.

I have seen such panics in situations where smbd is unable to connect to 
ctdb when non-public IPs were used for client connection and 
corresponding node is in DISCONNECTED/BANNED/STOPPED state. Following is 
one of that kind:

(gdb) bt
#0  0x00007fdbcf38d1f7 in raise () from /lib64/libc.so.6
#1  0x00007fdbcf38e8e8 in abort () from /lib64/libc.so.6
#2  0x00007fdbd0d134de in dump_core () at ../source3/lib/dumpcore.c:338
#3  0x00007fdbd0d046e7 in smb_panic_s3 (why=<optimized out>) at 
../source3/lib/util.c:814
#4  0x00007fdbd2ddf95f in smb_panic (why=why at entry=0x55f401ef8a50 
"reinit_after_fork() failed") at ../lib/util/fault.c:166
#5  0x000055f401ef7af8 in smbd_accept_connection (ev=0x55f4023836c0, 
fde=<optimized out>, flags=<optimized out>,
     private_data=<optimized out>) at ../source3/smbd/server.c:1014
#6  0x00007fdbcf725edb in epoll_event_loop (tvalp=0x7fff5b14c720, 
epoll_ev=0x55f402383940) at ../tevent_epoll.c:728
#7  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) 
at ../tevent_epoll.c:930
#8  0x00007fdbcf7242a7 in std_event_loop_once (ev=0x55f4023836c0, 
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent_standard.c:114
#9  0x00007fdbcf7200cd in _tevent_loop_once (ev=ev at entry=0x55f4023836c0,
     location=location at entry=0x55f401efb5d9 
"../source3/smbd/server.c:1384") at ../tevent.c:721
#10 0x00007fdbcf7202fb in tevent_common_loop_wait (ev=0x55f4023836c0, 
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent.c:844
#11 0x00007fdbcf724247 in std_event_loop_wait (ev=0x55f4023836c0, 
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent_standard.c:145
#12 0x000055f401ef2a95 in smbd_parent_loop (parent=<optimized out>, 
ev_ctx=0x55f4023836c0) at ../source3/smbd/server.c:1384
#13 main (argc=<optimized out>, argv=<optimized out>) at 
../source3/smbd/server.c:2038

The above backtrace is different from what I have observed right now as 
the smb_panic message says its an "internal error" rather than 
"reinit_after_fork() failed". Moreover the presence of 
tevent_common_loop_timer_delay() in the original backtrace which is 
invoked when epoll_wait() gets timed out inside epoll_event_loop() makes 
me think this situation to be fishy. Adding some details from original 
backtrace:

(gdb) f 11
#11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0ec8e0) 
at ../source3/lib/background.c:179

(gdb) p *state
$1 = {ev = 0x56508d0e2d10, msg = 0x56508d0e3800, trigger_msgs = 0x0, 
num_trigger_msgs = 0, parent_longlived = false,
   fn = 0x7f1d286cc420 <mess_parent_dgm_cleanup>, private_data = 
0x56508d0e3800, wakeup_req = 0x0, pipe_fd = -1, pipe_req = 0x0}

(gdb) p *req
$2 = {async = {fn = 0x7f1d286cc2d0 <mess_parent_dgm_cleanup_done>, 
private_data = 0x56508d0e3800}, data = 0x56508d0ec710,
   private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 
0x0, state = TEVENT_REQ_INIT}, internal = {
     private_type = 0x7f1d286f6457 "struct background_job_state", 
create_location = 0x7f1d286f63b8 "../source3/lib/background.c:61",
     finish_location = 0x0, cancel_location = 0x0, state = 
TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x56508d0ec660,
     defer_callback_ev = 0x0, timer = 0x0}}

Somehow I thought of a possible race condition but without much context 
around epoll and messaging initialization I cannot prove. Of course I am 
not an expert... just sharing my findings. I don't know whether it is 
worth for a second thought on the issue. Please correct me if I 
completely diverted in my debugging process.

Anoop C S



More information about the samba-technical mailing list