How to detect a client-closed connection during a write from our LDAP server?

Fri Oct 14 13:20:26 UTC 2022

On 10/13/2022 11:23 AM, Stefan Metzmacher via samba-technical wrote:
> Hi Andrew,
> 
>>> Ah, we register tstream_bsd_fde_handler() with
>>> TEVENT_FD_READ|TEVENT_FD_WRITE
>>> and have this logic:
>>>
>>> static void tstream_bsd_fde_handler(struct tevent_context *ev,
>>>                                       struct tevent_fd *fde,
>>>                                       uint16_t flags,
>>>                                       void *private_data)
>>> {
>>>           struct tstream_bsd *bsds =
>>> talloc_get_type_abort(private_data,
>>>                                      struct tstream_bsd);
>>>
>>>           if (flags & TEVENT_FD_WRITE) {
>>>                   bsds->writeable_handler(bsds->writeable_private);
>>>                   return;
>>>           }
>>>           if (flags & TEVENT_FD_READ) {
>>>                   if (!bsds->readable_handler) {
>>>                           if (bsds->writeable_handler) {
>>>
>>> =============> here we have the loop
>>>
>>
>> Awesome debugging!
> 
> 
> I created https://bugzilla.samba.org/show_bug.cgi?id=15202 for the problem.
> 
> And I understand what happens in the kernel...
> 
> tcp_fin() has this:
>          struct tcp_sock *tp = tcp_sk(sk);
> 
>          inet_csk_schedule_ack(sk);
> 
>          sk->sk_shutdown |= RCV_SHUTDOWN;
>          sock_set_flag(sk, SOCK_DONE);
> 
>          switch (sk->sk_state) {
>          case TCP_SYN_RECV:
>          case TCP_ESTABLISHED:
>                  /* Move to CLOSE_WAIT */
>                  tcp_set_state(sk, TCP_CLOSE_WAIT);
>                  inet_csk_enter_pingpong_mode(sk);
>                  break;
> 
> 
> It means RCV_SHUTDOWN gets set as well as TCP_CLOSE_WAIT, but
> sk->sk_err is not changed to indicate an error.

This is correct, because the TCP connection is in "half-closed" state.
The peer has closed, but the outgoing stream is still open. The TCP
protocol has supported this since forever.

This is not a transitory state. The connection can remain in it forever.
The peer is now in FIN_WAIT_2 and will send no further data. It's
waiting for our FIN, and in turn the local socket is waiting for a
close() call to do so. But pretty much any other socket operation
can still be performed.

> tcp_sendmsg_locked has this:
> ...
>          err = -EPIPE;
>      if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
>          goto do_error;
> 
>      while (msg_data_left(msg)) {
>          int copy = 0;
> 
>          skb = tcp_write_queue_tail(sk);
>          if (skb)
>              copy = size_goal - skb->len;
> 
>          if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
>              bool first_skb;
> 
> new_segment:
>              if (!sk_stream_memory_free(sk))
>                  goto wait_for_space;
> 
> ...
> 
> wait_for_space:
>          set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>          if (copied)
>              tcp_push(sk, flags & ~MSG_MORE, mss_now,
>                   TCP_NAGLE_PUSH, size_goal);
> 
>          err = sk_stream_wait_memory(sk, &timeo);
>          if (err != 0)
>              goto do_error;
> 
> It means if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) doesn't
> hit as we only have RCV_SHUTDOWN and sk_stream_wait_memory returns -EAGAIN.

Probably because the peer has stopped reading the socket. FIN_WAIT_2 is
a super-problematic state, because the only way to exit it is to receive
a FIN or RST, which we're evidently not sending. Most implementations
run a timer as failsafe, but it's always rather long (minutes).

> And tcp_poll has this:
> 
>          if (sk->sk_shutdown & RCV_SHUTDOWN)
>                  mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
> 
> So we'll get EPOLLIN | EPOLLRDNORM | EPOLLRDHUP triggering TEVENT_FD_READ
> and writev/sendmsg keep getting EAGAIN.
> 
> metze

I think the code needs to detect the half-close and give up. It's not
going to happen promptly any other way.

I may have missed some other message - is a fix proposed?

Tom.

> 
> 
>>>                                   bsds->writeable_handler(bsds-
>>>> writeable_private);
>>>                                   return;
>>>                           }
>>>                           TEVENT_FD_NOT_READABLE(bsds->fde);
>>>                           return;
>>>                   }
>>>                   bsds->readable_handler(bsds->readable_private);
>>>                   return;
>>>           }
>>> }
>>>
>>> We call the writeable handler (which is non-blocking) when we get
>>> TEVENT_FD_READ
>>> because we assume it will get an error if TEVENT_FD_READ was
>>> generated by
>>> EPOLLERR. I think moving TEVENT_FD_NOT_READABLE(bsds->fde); before
>>> if (bsds->writeable_handler) would make sure we only try that path
>>> once
>>> and not many times in a busy loop.
>>
>> How would we then, in this case, detect that the socket is closed?
>>
>> While we would start processing other connections, wouldn't we change
>> to a not-busy wait never closing the socket as there is still data to
>> send but never getting to read the EOF?
>>
>> How can we proceed to closing the socket?
>>
>>> And in ldapsrv_call_writev_start() we may want to use
>>> tevent_req_set_endtime() on
>>> the subreq of tstream_writev_queue_send(), so that
>>> ldapsrv_call_writev_done()
>>> will get ETIMEDOUT from tstream_writev_queue_recv() and call
>>> ldapsrv_terminate_connection().
>>
>> Do we have a good precedent for what the send timeout should be?
>>
>> I can't see any good examples elsewhere in the code sadly.
>>
>> One option is to reuse the search timeout, so that a query could at
>> most take 240sec, eg 2x 120sec, once for the search itself and once to
>> send over the network.  This would give only one knob to change (which
>> is good and bad).
>>
>> In the meantime I'm going to try and mock this up in a self-contained
>> cmocka test.
>>
>> Andrew Bartlett
> 
> 
>