[Samba] Odd Samba 4 ("4.2.0pre1-GIT-b505111"; actually only using client) behaviour #2 - "accept: Software caused connection abort".

Tris Mabbs TM-Samba201302 at Firstgrade.Co.UK
Thu Aug 29 03:10:38 MDT 2013


Hiya Andrew,

Many thanks for the typically helpful and comprehensive reply :-)

> I think that's probably the right track :-)
>
> The code here is triggered when poll() indicates that the socket is
readable.
> This socket should only be readable when a new connection is being made,
and accept() should succeed.
> ...
> So, my only conclusion is that your box momentarily does not have the
resources to accept the connection,
> and because there isn't the sleep() in the source3 code, it prints this in
a loop until the resources become available.

Absolutely, and on any normal Unix implementation I'd agree entirely.  That
sort of "poll()"/"accept()"/... code is perfectly normal and exactly what
you'd expect - I've written plenty of very similar code myself over the
years ...
However this is "Solaris" :-(

Caught in the act:

...
16327:     pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x00000000)  = 1
16327:          fd=39 ev=POLLIN|POLLHUP rev=0
16327:          fd=38 ev=POLLIN|POLLHUP rev=0
16327:          fd=34 ev=POLLIN|POLLHUP rev=0
16327:          fd=36 ev=POLLIN|POLLHUP rev=0
16327:          fd=37 ev=POLLIN|POLLHUP rev=POLLIN
16327:          fd=35 ev=POLLIN|POLLHUP rev=0
16327:          fd=33 ev=POLLIN|POLLHUP rev=0
16327:          fd=6  ev=POLLIN|POLLHUP rev=0
16327:          timeout: 59.999000000 sec
16327:     accept(37, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) = 41
16327:          AF_INET  name = X.X.X.X  port = 28986
16327:     forkx(0)                        = 26942
16327:     lwp_sigmask(SIG_SETMASK, 0x00011080, 0x00000000, 0x00000000,
0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
16327:     close(41)                       = 0
16327:     pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x00000000)  = 1
16327:          fd=39 ev=POLLIN|POLLHUP rev=0
16327:          fd=38 ev=POLLIN|POLLHUP rev=0
16327:          fd=34 ev=POLLIN|POLLHUP rev=0
16327:          fd=36 ev=POLLIN|POLLHUP rev=0
16327:          fd=35 ev=POLLIN|POLLHUP rev=POLLIN
16327:          fd=33 ev=POLLIN|POLLHUP rev=0
16327:          fd=6  ev=POLLIN|POLLHUP rev=0
16327:          fd=37 ev=POLLIN|POLLHUP rev=0
16327:          timeout: 44.696000000 sec
16327:     accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130
ECONNABORTED
...

So there's nothing odd about the "poll()".  Typically Solaris will flag
POLLERR in "revents" if it's out of resources, and POLLHUP if the remote end
closed the connection before it was fully established (remote NAKed, or
ignored, the connection SYN; terminally low on resources at t'other end of
the socket; ...).  Neither is happening here which would suggest things are
proceeding as normal for the connection establishment.

The server darn' well shouldn't be out of any resources either.  In terms of
physical resources, at the point that occurred the CPUs were at 99.9% idle,
there was 15Gb of free RAM (so not out of kernel memory then ...) and only a
total of about 400 sockets (TCP, Unix, ...) in use across the entire system,
as reported by "netstat -na | wc -l" - well below peak levels seen on this
system.

So it's going to be that hypothetical Solaris specific
SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, isn't it :-)

So could I request please, that in the source3 code, either:
	a. The same "sleep()" is added as in the source4 code; -and/or-
	b. If errno == ECONNABORTED then only log the error if the debug
level is (substantially?) higher than zero.

I think it's probably safe to assume that ECONNABORTED is generally
ignoreable; for whatever reason, Solaris seems to return this at the drop of
a metaphorical hat (and ignoring it on other OS' isn't going to be a problem
either).  Maybe the same with EAGAIN (and possibly EWOULDBLOCK), as other
"Ignore this unless the user REALLY wants a lot of debug output" type
"errors"?

This would also seem to be common practice - a quick Google for "accept()
ignore ECONNABORTED" comes back with a lot of results, mainly showing other
open source code having been modified specifically to ignore ECONNABORTED.

Cheers!

Tris.

-----Original Message-----
From: Andrew Bartlett [mailto:abartlet at samba.org] 
Sent: 29 August 2013 00:41
To: Tris Mabbs
Cc: samba at lists.samba.org; samba-technical at samba.org
Subject: Re: [Samba] Odd Samba 4 ("4.2.0pre1-GIT-b505111"; actually only
using client) behaviour #2 - "accept: Software caused connection abort".

On Sun, 2013-08-25 at 18:50 +0100, Tris Mabbs wrote:
>                 Probably should have posted this to "samba-technical" 
> in the first place, so re-posting in case anyone has any useful ideas .
> 
>  
> 
> From: Tris Mabbs
> 
> Sent: 12 August 2013 23:08
> To: 'samba at lists.samba.org'
> Subject: Odd Samba 4 ("4.2.0pre1-GIT-b505111"; actually only using 
> client) behaviour #2 - "accept: Software caused connection abort".
> 
>  
> 
>                 Good day oh technical ones .
> 
>  
> 
>                 I was running Samba 4 (client only, not using it as a 
> DC so effectively running Samba 3 code from the Samba 4 tree) and, 
> other than a little "Gotcha!" regarding decoding Kerberos PACs, it was 
> all working perfectly.
> 
>                 Then recently I had to upgrade, to "4.2.0pre1-GIT-b505111"
> (I had to upgrade the OS on the server running Samba - 'twas "OpenSolaris"
> and is now "Solaris 11.1") so I recompiled it all up and installed 
> afresh (so no ".tdb"s from the previous installation or anything).
> 
>  
> 
>                 But here's a funny thing (#2).  The log file gets 
> absolutely ridiculous numbers of messages thus:
> 
>  
> 
> Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error] 
> [2013/08/12 22:45:01.731562,  0] 
> ../source3/smbd/server.c:556(smbd_accept_connection)
> 
> Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
> Software caused connection abort
> 
> Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] 
> [2013/08/12 22:45:03.556423,  0] 
> ../source3/smbd/server.c:556(smbd_accept_connection)
> 
> Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
> Software caused connection abort
> 
> Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] 
> [2013/08/12 22:45:03.556688,  0] 
> ../source3/smbd/server.c:556(smbd_accept_connection)
> 
> Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
> Software caused connection abort
> 
>  
> 
>                 And so on.  These will come in spurts; there won't be 
> any such messages for several minutes then a whole load will come 
> along all at once.  Rather like busses .

> 
>                 I will catch "smbd" in the act at some point though, 
> and when I do I'll follow-up with a system call trace to show exactly 
> what is happening when this message gets triggered.  It will, of 
> course, be something bizarrely Solaris specific (you didn't set the 
> "SO_DONT_RANDOMLY_ABORT_CONNECTIONS" socket() option, did you?  Tsk 
> tsk tsk .).

I think that's probably the right track :-)

The code here is triggered when poll() indicates that the socket is reaable.
This socket should only be readable when a new connection is being made, and
accept() should succeed.

In the source4/smbd/process_single.c code equivalent to this, there is this
helpful hint:
	/* accept an incoming connection. */
	status = socket_accept(listen_socket, &connected_socket);
	if (!NT_STATUS_IS_OK(status)) {
		DEBUG(0,("single_accept_connection: accept: %s\n",
nt_errstr(status)));
		/* this looks strange, but is correct. 

		   We can only be here if woken up from select, due to
		   an incoming connection.

		   We need to throttle things until the system clears
		   enough resources to handle this new socket. 

		   If we don't then we will spin filling the log and
		   causing more problems. We don't panic as this is
		   probably a temporary resource constraint */
		sleep(1);
		return;
	}

So, my only conclusion is that your box momentarily does not have the
resources to accept the connection, and because there isn't the sleep() in
the source3 code, it prints this in a loop until the resources become
available. 

Andrew Bartlett
--
Andrew Bartlett
http://samba.org/~abartlet/
Authentication Developer, Samba Team           http://samba.org
Samba Developer, Catalyst IT                   http://catalyst.net.nz





More information about the samba mailing list