[Samba] Odd Samba 4 ("4.2.0pre1-GIT-b505111"; actually only using client) behaviour #2 - "accept: Software caused connection abort".
Tris Mabbs
TM-Samba201302 at Firstgrade.Co.UK
Thu Aug 29 16:45:01 MDT 2013
Hiya Jeremy,
> So your problem is the debug statement being triggered repeatedly ?
Yup.
> Adding a sleep is (IMHO) the wrong thing to do.
It has the advantage of pretty much guaranteeing the problem will "go away";
it has the disadvantage of blocking the thread/process.
However it is what the Samba4 client code does (so a similar change to the
Samba3 would be consistent; of course, so would a different change to both
codebases ...).
> Once the accept() has failed the 'POLLIN' event should not be triggered
repeatedly on the polled socket.
> Your truss trace doesn't show enough. Does a subsequent pollsys() keep
returning fd=35 ev=POLLIN|POLLHUP rev=POLLIN after the:
>
> accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130 ECONNABORTED
Now that's a very interesting question ...
OK, a quick dig around later and we get (abridged):
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x00000000) = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=POLLIN
fd=6 ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
timeout: 32.547000000 sec
accept(32, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) Err#130 ECONNABORTED
...
write(8, " a c c e p t : S o".., 43) = 43
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x00000000) = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=POLLIN
fd=33 ev=POLLIN|POLLHUP rev=0
fd=6 ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=0
timeout: 32.546000000 sec
accept(31, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) = 38
AF_INET name = X.X.X.X port = 55935
forkx(0) = 10502
...
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x00000000) = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=POLLIN
fd=31 ev=POLLIN|POLLHUP rev=0
fd=6 ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
timeout: 31.034000000 sec
accept(32, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) Err#130 ECONNABORTED
...
write(8, " a c c e p t : S o".., 43) = 43
Received signal #18, SIGCLD, in pollsys() [caught]
siginfo: SIGCLD CLD_EXITED pid=10504 status=0x0000
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x00000000) Err#4 EINTR
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=0
fd=6 ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=0
timeout: 31.032000000 sec
So that would be a no - next "poll()" and there's no revent flagged on that
same socket.
Which would confirm your thought that "sleep()" is perhaps not the way to
go. However I don't know the Samba code (at all!) nearly well enough to
comment - that "sleep()" may be serving some other vital purpose under
different circumstances?
Either way, it would appear that my second suggestion would still be valid -
only log this (and possibly a couple of other error conditions) when more
debugging is enabled?
Another passing thought ...
That "truss" only captured 2 ECONNABORTED incidents - typical that nothing
much happens when you're specifically looking at it.
However, is it likely to be a coincidence that both were on the same socket?
FD#32 happens to be bound to port 445 on one specific interface of the
machine; tomorrow I might try a more extended test and poke lots of traffic
at that interface (and/or might stick the socket descriptor number into the
debug message) - if anything interesting presents itself (E.g., it's always
the same port, or interface, ... where the problem occurs) I'll post an
update saying so.
Probably doesn't affect the solution, but possibly technically interesting
anyway ...
Many thanks, and regards,
Tris.
-----Original Message-----
From: Jeremy Allison [mailto:jra at samba.org]
Sent: 29 August 2013 20:52
To: Tris Mabbs
Cc: 'Andrew Bartlett'; samba at lists.samba.org; samba-technical at samba.org
Subject: Re: [Samba] Odd Samba 4 ("4.2.0pre1-GIT-b505111"; actually only
using client) behaviour #2 - "accept: Software caused connection abort".
On Thu, Aug 29, 2013 at 10:10:38AM +0100, Tris Mabbs wrote:
> Hiya Andrew,
>
> Many thanks for the typically helpful and comprehensive reply :-)
>
> > I think that's probably the right track :-)
> >
> > The code here is triggered when poll() indicates that the socket is
> readable.
> > This socket should only be readable when a new connection is being
> > made,
> and accept() should succeed.
> > ...
> > So, my only conclusion is that your box momentarily does not have
> > the
> resources to accept the connection,
> > and because there isn't the sleep() in the source3 code, it prints
> > this in
> a loop until the resources become available.
>
> Absolutely, and on any normal Unix implementation I'd agree entirely.
> That sort of "poll()"/"accept()"/... code is perfectly normal and
> exactly what you'd expect - I've written plenty of very similar code
> myself over the years ...
> However this is "Solaris" :-(
>
> Caught in the act:
>
> ...
> 16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x00000000) = 1
> 16327: fd=39 ev=POLLIN|POLLHUP rev=0
> 16327: fd=38 ev=POLLIN|POLLHUP rev=0
> 16327: fd=34 ev=POLLIN|POLLHUP rev=0
> 16327: fd=36 ev=POLLIN|POLLHUP rev=0
> 16327: fd=37 ev=POLLIN|POLLHUP rev=POLLIN
> 16327: fd=35 ev=POLLIN|POLLHUP rev=0
> 16327: fd=33 ev=POLLIN|POLLHUP rev=0
> 16327: fd=6 ev=POLLIN|POLLHUP rev=0
> 16327: timeout: 59.999000000 sec
> 16327: accept(37, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) = 41
> 16327: AF_INET name = X.X.X.X port = 28986
> 16327: forkx(0) = 26942
> 16327: lwp_sigmask(SIG_SETMASK, 0x00011080, 0x00000000, 0x00000000,
> 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
> 16327: close(41) = 0
> 16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x00000000) = 1
> 16327: fd=39 ev=POLLIN|POLLHUP rev=0
> 16327: fd=38 ev=POLLIN|POLLHUP rev=0
> 16327: fd=34 ev=POLLIN|POLLHUP rev=0
> 16327: fd=36 ev=POLLIN|POLLHUP rev=0
> 16327: fd=35 ev=POLLIN|POLLHUP rev=POLLIN
> 16327: fd=33 ev=POLLIN|POLLHUP rev=0
> 16327: fd=6 ev=POLLIN|POLLHUP rev=0
> 16327: fd=37 ev=POLLIN|POLLHUP rev=0
> 16327: timeout: 44.696000000 sec
> 16327: accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130
> ECONNABORTED
> ...
>
> So there's nothing odd about the "poll()". Typically Solaris will
> flag POLLERR in "revents" if it's out of resources, and POLLHUP if the
> remote end closed the connection before it was fully established
> (remote NAKed, or ignored, the connection SYN; terminally low on
> resources at t'other end of the socket; ...). Neither is happening
> here which would suggest things are proceeding as normal for the
connection establishment.
>
> The server darn' well shouldn't be out of any resources either. In
> terms of physical resources, at the point that occurred the CPUs were
> at 99.9% idle, there was 15Gb of free RAM (so not out of kernel memory
> then ...) and only a total of about 400 sockets (TCP, Unix, ...) in
> use across the entire system, as reported by "netstat -na | wc -l" -
> well below peak levels seen on this system.
>
> So it's going to be that hypothetical Solaris specific
> SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, isn't it :-)
>
> So could I request please, that in the source3 code, either:
> a. The same "sleep()" is added as in the source4 code; -and/or-
> b. If errno == ECONNABORTED then only log the error if the debug
> level is (substantially?) higher than zero.
So your problem is the debug statement being triggered repeatedly ?
Adding a sleep is (IMHO) the wrong thing to do. Once the accept() has failed
the 'POLLIN' event should not be triggered repeatedly on the polled socket.
Your truss trace doesn't show enough. Does a subsequent pollsys() keep
returning fd=35 ev=POLLIN|POLLHUP rev=POLLIN after the:
accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130 > ECONNABORTED
?
Jeremy.
More information about the samba
mailing list