Winbindd consuming 100% CPU utilization when the DC is offline.

Hemanth Thummala hemanth.thummala at nutanix.com
Fri Jun 2 06:19:59 UTC 2017


Hello All,

We are using samba 4.3.11 stack. Recently, we have come across the issue of winbindd processes consuming 100% CPU utilization when either DC is offline or there is some network issue in reaching the DC.

We have sampled few backtraces when the winbindd was in this state. Most of the times, we could see the process spending time in async_connect_send() where the winbindd process is trying connect to DC asynchronously. Since the DC is not reachable, this connection process is going in loop causing the CPU spike.


(gdb) bt

#0  0x00007fe7ddaeea45 in fcntl () from /lib64/libpthread.so.0

#1  0x00007fe7d626f5aa in async_connect_send (mem_ctx=0x55ed74c950e0, ev=0x55ed74c972b0, fd=31, address=0x55ed74c950f0, address_len=16, before_connect=0, after_connect=0,

    private_data=0x0) at ../lib/async_req/async_sock.c:94

#2  0x00007fe7d85f2b08 in open_socket_out_connected (subreq=0x0) at ../source3/lib/util_sock.c:571

#3  0x00007fe7db531318 in _tevent_req_notify_callback (req=0x55ed74c91f50, location=0x7fe7d6270dc8 "../lib/async_req/async_sock.c:136") at ../lib/tevent/tevent_req.c:112

#4  0x00007fe7db5313ed in tevent_req_finish (req=0x55ed74c91f50, state=TEVENT_REQ_USER_ERROR, location=0x7fe7d6270dc8 "../lib/async_req/async_sock.c:136")

    at ../lib/tevent/tevent_req.c:149

#5  0x00007fe7db531512 in tevent_req_trigger (ev=0x55ed74c972b0, im=0x55ed74c92030, private_data=0x55ed74c91f50) at ../lib/tevent/tevent_req.c:206

#6  0x00007fe7db5306a4 in tevent_common_loop_immediate (ev=0x55ed74c972b0) at ../lib/tevent/tevent_immediate.c:135

#7  0x00007fe7db539146 in epoll_event_loop_once (ev=0x55ed74c972b0, location=0x7fe7db539840 "../lib/tevent/tevent_req.c:256") at ../lib/tevent/tevent_epoll.c:906

#8  0x00007fe7db535f1f in std_event_loop_once (ev=0x55ed74c972b0, location=0x7fe7db539840 "../lib/tevent/tevent_req.c:256") at ../lib/tevent/tevent_standard.c:114

#9  0x00007fe7db52f769 in _tevent_loop_once (ev=0x55ed74c972b0, location=0x7fe7db539840 "../lib/tevent/tevent_req.c:256") at ../lib/tevent/tevent.c:533

#10 0x00007fe7db5316ac in tevent_req_poll (req=0x55ed74c94f50, ev=0x55ed74c972b0) at ../lib/tevent/tevent_req.c:256

#11 0x00007fe7d85f2d22 in open_socket_out (pss=0x7ffea0c66a40, port=135, timeout=60000, pfd=0x7ffea0c66a2c) at ../source3/lib/util_sock.c:642

#12 0x00007fe7db751bdc in rpc_pipe_open_tcp_port (mem_ctx=0x55ed74c6efb0, host=0x55ed74c8be10 “DRTNASDCPROD01.testdomain.com", ss_addr=0x55ed74c7b168, port=135,

    table=0x7fe7dab69500, presult=0x7ffea0c66b88) at ../source3/rpc_client/cli_pipe.c:2669

#13 0x00007fe7db751e5d in rpc_pipe_get_tcp_port (host=0x55ed74c8be10 “DRTNASDCPROD01.testdomain.com", addr=0x55ed74c7b168, table=0x7fe7da007ae0,

    pport=0x7ffea0c66c0e) at ../source3/rpc_client/cli_pipe.c:2737

#14 0x00007fe7db7521bc in rpc_pipe_open_tcp (mem_ctx=0x0, host=0x55ed74c8be10 “DRTNASDCPROD01.testdomain.com", addr=0x55ed74c7b168, table=0x7fe7da007ae0,

    presult=0x7ffea0c66cc8) at ../source3/rpc_client/cli_pipe.c:2867

#15 0x00007fe7db752a99 in cli_rpc_pipe_open (cli=0x55ed74c91600, transport=NCACN_IP_TCP, table=0x7fe7da007ae0, presult=0x7ffea0c66cc8) at ../source3/rpc_client/cli_pipe.c:3046

#16 0x00007fe7db753340 in cli_rpc_pipe_open_schannel_with_creds (cli=0x55ed74c91600, table=0x7fe7da007ae0, transport=NCACN_IP_TCP, cli_creds=0x55ed74c8f0b0,

    netlogon_creds=0x55ed74cb3480, _rpccli=0x55ed74c548d0) at ../source3/rpc_client/cli_pipe.c:3276

#17 0x000055ed7481972d in cm_connect_netlogon_transport (domain=0x55ed74c54730, transport=NCACN_IP_TCP, cli=0x7ffea0c66e48) at ../source3/winbindd/winbindd_cm.c:3376

#18 0x000055ed74819844 in cm_connect_netlogon (domain=0x55ed74c54730, cli=0x7ffea0c66e48) at ../source3/winbindd/winbindd_cm.c:3408

#19 0x000055ed74826155 in trusted_domains (domain=0x55ed74c54730, mem_ctx=0x55ed74c5bb10, trusts=0x7ffea0c671a0) at ../source3/winbindd/winbindd_ads.c:1570

#20 0x000055ed74820de3 in trusted_domains (domain=0x55ed74c54730, mem_ctx=0x55ed74c5bb10, trusts=0x7ffea0c671a0) at ../source3/winbindd/winbindd_reconnect_ads.c:299

#21 0x000055ed748002e4 in trusted_domains (domain=0x55ed74c54730, mem_ctx=0x55ed74c5bb10, trusts=0x7ffea0c671a0) at ../source3/winbindd/winbindd_cache.c:2912

#22 0x000055ed7480ee85 in winbindd_dual_list_trusted_domains (domain=0x55ed74c54730, state=0x7ffea0c67418) at ../source3/winbindd/winbindd_misc.c:163

#23 0x000055ed74829de4 in child_process_request (child=0x55ed74c54980, state=0x7ffea0c67418) at ../source3/winbindd/winbindd_dual.c:518

#24 0x000055ed7482c998 in child_handler (ev=0x55ed74c4e8f0, fde=0x55ed74c550b0, flags=1, private_data=0x7ffea0c67410) at ../source3/winbindd/winbindd_dual.c:1411

#25 0x00007fe7db538bac in epoll_event_loop (epoll_ev=0x55ed74c5adf0, tvalp=0x7ffea0c67320) at ../lib/tevent/tevent_epoll.c:728

#26 0x00007fe7db5391f6 in epoll_event_loop_once (ev=0x55ed74c4e8f0, location=0x55ed748af400 "../source3/winbindd/winbindd_dual.c:1637") at ../lib/tevent/tevent_epoll.c:926

#27 0x00007fe7db535f1f in std_event_loop_once (ev=0x55ed74c4e8f0, location=0x55ed748af400 "../source3/winbindd/winbindd_dual.c:1637") at ../lib/tevent/tevent_standard.c:114

#28 0x00007fe7db52f769 in _tevent_loop_once (ev=0x55ed74c4e8f0, location=0x55ed748af400 "../source3/winbindd/winbindd_dual.c:1637") at ../lib/tevent/tevent.c:533

#29 0x000055ed7482d4ee in fork_domain_child (child=0x55ed74c54980) at ../source3/winbindd/winbindd_dual.c:1637

#30 0x000055ed74828d68 in wb_child_request_trigger (req=0x55ed74c70640, private_data=0x0) at ../source3/winbindd/winbindd_dual.c:178

#31 0x00007fe7db530a26 in tevent_queue_immediate_trigger (ev=0x55ed74c4e8f0, im=0x55ed74c5af20, private_data=0x55ed74c5ae80) at ../lib/tevent/tevent_queue.c:149

#32 0x00007fe7db5306a4 in tevent_common_loop_immediate (ev=0x55ed74c4e8f0) at ../lib/tevent/tevent_immediate.c:135

#33 0x00007fe7db539146 in epoll_event_loop_once (ev=0x55ed74c4e8f0, location=0x55ed74899140 "../source3/winbindd/winbindd.c:1807") at ../lib/tevent/tevent_epoll.c:906

#34 0x00007fe7db535f1f in std_event_loop_once (ev=0x55ed74c4e8f0, location=0x55ed74899140 "../source3/winbindd/winbindd.c:1807") at ../lib/tevent/tevent_standard.c:114

#35 0x00007fe7db52f769 in _tevent_loop_once (ev=0x55ed74c4e8f0, location=0x55ed74899140 "../source3/winbindd/winbindd.c:1807") at ../lib/tevent/tevent.c:533

#36 0x000055ed747f2d5f in main (argc=2, argv=0x7ffea0c68fd8) at ../source3/winbindd/winbindd.c:1807

As you can see in this case, its forked domain child process is trying to connect to rpc pipe(port 135) of DC with a timeout of 60 seconds.


status = open_socket_out(&addr, port, 60*1000, &fd);

..
Which inturn invokes this part of code:


req = open_socket_out_send(frame, ev, pss, port, timeout);

  state->wait_usec = 10000; ==>> 10ms Used as retry timeout.


  state->connect_subreq = async_connect_send(

    state, state->ev, state->fd, (struct sockaddr *)&state->ss,

    state->salen, NULL, NULL, NULL);

  if ((state->connect_subreq == NULL)

      || !tevent_req_set_endtime(

        state->connect_subreq, state->ev,

        timeval_current_ofs(0, state->wait_usec))) {

    goto fail;

  }

  tevent_req_set_callback(state->connect_subreq,

        ***open_socket_out_connected*** ==> Callback, result);

…

In async_connect_send():


state->result = connect(fd, address, address_len);


 /*

   * The only errno indicating that the connect is still in

   * flight is EINPROGRESS, everything else is an error

   */


  if (errno != EINPROGRESS) {

    tevent_req_error(req, errno);

    return tevent_req_post(req, ev);

  }


  state->fde = tevent_add_fd(ev, state, fd, TEVENT_FD_WRITE,

           async_connect_connected, req);

  if (state->fde == NULL) {

    tevent_req_error(req, ENOMEM);

    return tevent_req_post(req, ev);

  }

Here connect returning EINPROGRESS.  Later, we could see  async_connect_connected() being called. Not really sure who has done the socket write when the sync connect operation has not yet completed.


In async_connect_connected():



  ret = getsockopt(state->fd, SOL_SOCKET, SO_ERROR,

       &socket_error, &slen);


  if (ret != 0) {

    /*

     * According to Stevens this is the Solaris behaviour

     * in case the connection encountered an error:

     * getsockopt() fails, error is in errno

     */

    tevent_req_error(req, errno); ==> We will reach here as connect request not successful(or in progress) . This triggers the callback handler open_socket_out_connected.

    return;

  }


..

In open_socket_out_connected():


  if (

#ifdef ETIMEDOUT

    (sys_errno == ETIMEDOUT) ||

#endif

    (sys_errno == EINPROGRESS) || ==>>

    (sys_errno == EALREADY) || ==>>

    (sys_errno == EAGAIN)) {

    /*

     * retry

     */


    if (state->wait_usec < 250000) {

      state->wait_usec *= 1.5;

    }


    subreq = async_connect_send(state, state->ev, state->fd,

              (struct sockaddr *)&state->ss,

              state->salen, NULL, NULL, NULL);

    if (tevent_req_nomem(subreq, req)) {

      return;

    }

    if (!tevent_req_set_endtime(

          subreq, state->ev,

          timeval_current_ofs_usec(state->wait_usec))) {

      tevent_req_nterror(req, NT_STATUS_NO_MEMORY);

      return;

    }

    state->connect_subreq = subreq;

    tevent_req_set_callback(subreq, open_socket_out_connected, req);

    return;

  }

Here we enter this loop and continues to be there is this state for long time. we guess this is in this state for the request timeout(60 seconds)  period. If there are multiple requests already queuedup, then we could see that this wait time is getting extended. Also, we have observed that there are new domain child processes getting forked which are also eventually getting into the same state.

Looking at connect fails with ELAREADY when the socket is non-blocking and has a pending connection in progress.

We are not really sure the reason for retrying the async connect in loop when the errno is EINPROGRESS or EALREADY. Avoiding the loop when error is EALREADY helped us the situation. But, we would like to check if this is right way of fixing the issue.

Thanks,
Hemanth.


More information about the samba-technical mailing list