How to detect a client-closed connection during a write from our LDAP server?

Fri Oct 14 18:15:37 UTC 2022

On 10/14/2022 10:03 AM, Stefan Metzmacher wrote:
> Am 14.10.22 um 15:52 schrieb Tom Talpey:
>> On 10/14/2022 9:45 AM, Stefan Metzmacher wrote:
>>> Hi Tom,
>>>
>>>>> It means RCV_SHUTDOWN gets set as well as TCP_CLOSE_WAIT, but
>>>>> sk->sk_err is not changed to indicate an error.
>>>>
>>>> This is correct, because the TCP connection is in "half-closed" state.
>>>> The peer has closed, but the outgoing stream is still open. The TCP
>>>> protocol has supported this since forever.
>>>>
>>>> This is not a transitory state. The connection can remain in it 
>>>> forever.
>>>> The peer is now in FIN_WAIT_2 and will send no further data. It's
>>>> waiting for our FIN, and in turn the local socket is waiting for a
>>>> close() call to do so. But pretty much any other socket operation
>>>> can still be performed.
>>>
>>> Thanks for the explanation!
>>>
>>>>> It means if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) doesn't
>>>>> hit as we only have RCV_SHUTDOWN and sk_stream_wait_memory returns 
>>>>> -EAGAIN.
>>>>
>>>> Probably because the peer has stopped reading the socket. FIN_WAIT_2 is
>>>> a super-problematic state, because the only way to exit it is to 
>>>> receive
>>>> a FIN or RST, which we're evidently not sending. Most implementations
>>>> run a timer as failsafe, but it's always rather long (minutes).
>>>
>>> Yes, we need 'socket options' with TCP_KEEPCNT, TCP_KEEPIDLE, 
>>> TCP_KEEPINTVL and/or TCP_USER_TIMEOUT
>>> and/or a user space timer in order to have lower timeouts.
>>
>> That won't help. The peer is there, and the connection is up.
>> The keepalive will succeed! Even if it failed, it's not prompt,
>> and reducing the KEEPINTVL is a very bad idea. Servers should not
>> be pinging their clients in any event.
>>
>> What peer is doing this? Most Windows clients will perform an
>> abortive close, but this one is doing it  gracefully. The
>> server should deal with either, of course, so I'm mostly just
>> curious.
> 
> I guess the client is gone or it waits for our FIN,ACK
> but it no longers acks the data from our sendqueue, which we most likely 
> try
> to send out before sending out FIN,ACK.

Technically speaking, it's waiting for the server's FIN. The TCP
layer has acked the incoming FIN, but has left the sending side
open until the server app calls close(sock).

> But I only have the information from the public mails and I haven't
> tried to reproduce it.

I think the challenge is to determine what combination of pollfd bits
come back when the socket is in this state. If the server can detect
this, it can close the socket.

>  From https://lists.samba.org/archive/samba/2022-September/241873.html:
>  > As clients we have some NetAPP-FAS running which doing the auth. via 
> LDAP. On NetApp timeouts for LDAP are set to 3sec per default.
>  >
>  > Some queries seem to need more time to answer so the client tries to 
> close the connection but the (samba-)server-part leaves the socket open 
> in CLOSE_WAIT.
>  >
>  > In some of such cases the corresponding process (ldap-worker) runs 
> forever(?) with 100% cpu. A strace shows the ldap-worker pushing some 
> info (the answer?)
>  > to the socket. If one let it go the server slows down gradually while 
> more and more connections stay in CLOSE_WAIT.

Right, that's because the server is looping between seeing the pollfd
bits, attempting to send, getting EAGAIN, and repeat, right?

We can't just bail out on EAGAIN, so it's down to figuring out the
pollfd, or calling some socket state API when looping,  to detect
the half-close.

Tom.