Solaris generating ECONNRESET causing Samba failures

Thu Aug 30 13:16:29 GMT 2001

Scott Moomaw wrote:
>  	 In the log files, I'm seeing lines that
> reference "read_sock_data: recv failure for 4.  Error = Connection reset
> by peer" on an ongoing basis.  When this happens, things go haywire.  I've
> included a log snippet for viewing.
> 
> [2001/08/29 17:13:31, 1, pid=17083] smbd/service.c:make_connection(610)
>   trainer (147.138.20.16) connect to service scollins as user scollins
> (uid=1038, gid=14) (pid 17083)
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:read_socket_data(478)
>   read_socket_data: recv failure for 4. Error = Connection reset by peer
> [2001/08/29 17:26:45, 1, pid=17083] smbd/service.c:close_cnum(650)
>   trainer (147.138.20.16) closed connection to service scollins
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:get_socket_addr(1031)
>   getpeername failed. Error was Transport endpoint is not connected
> [2001/08/29 17:26:45, 1, pid=17083] lib/util_sock.c:get_socket_name(996)
>   Gethostbyaddr failed for 0.0.0.0
> 
> The strange thing is that I'm not finding a cause for the connection
> resets.  A packet trace, using Solaris's snoop command, doesn't reveal any
> normal RST condition which in turn would cause the connection to reset.

	The word is being used in two senses: sending a
	packet with RST set, meaning the other end wants
	to tear down the connection right now, or finding
	the connection torn down unexpectedly.

	I expect we're seeing the latter...

> Looking at a truss of one of these processes, I find data like what
> follows:
> 
> 7776:   173.2877        poll(0x08047884, 3, 60000)                      = 1
> 7776:   173.2880        read(5, "\0\0\0 3", 4)                          = 4
> 7776:   173.2882        read(5, "FF S M B1A\0\0\0\0\0\0\0".., 51)       = 51
> 7776:   173.2883        gettimeofday(0x081DF1C4)                        = 0
> 7776:   173.2884        fstat64(25, 0x08047908)                         = 0
> 7776:   173.2885        llseek(25, 2809801, SEEK_SET)                   = 2809801
> 7776:   173.2886        read(25, 0x082222D5, -3070)                     Err#22 EINVAL

	First, -3070 sounds odd: the third arg is a size_t,
	which is an unsigned.

	Second, EINVAL is supposed to mean "An attempt was made
	to read from a stream linked to  a multiplexor"
	(back from Unix v6/v7); this is a Sun bug, I suspect.

	Ooops, never mind, neither is relevant, as fd 15
	isn't to the client: it's fd 5 (below)


> 7776:   173.2887        write(5, "\001FF", 3)                           = 3
> 7776:   poll(0x08047884, 3, 60000)      (sleeping...)
> 7776:   219.6291        poll(0x08047884, 3, 60000)                      = 1
> 7776:   219.6298        read(5, 0x08211E89, 4)                          Err#131ECONNRESET

	This is the second sense of reset: the other end
	has disappeared!


> 7776:   219.6299        time()                                          = 999108532
> 
> The ECONNRESET from the read call directly corresponds to the connection
> reset in the logs.  Does anyone have a suggestion as to what could be
> causing the ECONNRESET?  I can't find any evidence from snoop, 

	It should show a sudden cessation of packets from the
	client after the write(5, "\001FF", 3), or just
	possibly a packet with FIN set (a half-close).
	
	If not, can you mail me the raw snoop?

 	  Does Solaris have a bug that can
> generate spurios ECONNRESET messages?  Can anyone think of a possible
> workaround if this is the case?

	Not in the sense we mean: the man page is bogus,
	though, as it doesn't admit you can get an ECONNRESET
	(reported to Julia, in hopes it's soemthing she
	can confirm/fix).

	The common case is a user switching off their machine
	or getting a blue screen of death, then rebooting.
	Both cause a Samba process to see wierd stuff, and
	report resets, and the reboot causes a flurry of
	releasing the dead machine's oplocks and giving them
	to the ressurected machine's samba process.

	We've also seen routers, hubs and ethernet cards fail
	and produce thsi symptom.

--dave
-- 
David Collier-Brown,           | Always do right. This will gratify 
Performance & Engineering Team | some people and astonish the rest.
Americas Customer Engineering  |                      -- Mark Twain
(905) 415-2849                 | davecb at canada.sun.com