Solaris generating ECONNRESET causing Samba failures
davecb at canada.sun.com
Thu Aug 30 13:16:29 GMT 2001
Scott Moomaw wrote:
> In the log files, I'm seeing lines that
> reference "read_sock_data: recv failure for 4. Error = Connection reset
> by peer" on an ongoing basis. When this happens, things go haywire. I've
> included a log snippet for viewing.
> [2001/08/29 17:13:31, 1, pid=17083] smbd/service.c:make_connection(610)
> trainer (220.127.116.11) connect to service scollins as user scollins
> (uid=1038, gid=14) (pid 17083)
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:read_socket_data(478)
> read_socket_data: recv failure for 4. Error = Connection reset by peer
> [2001/08/29 17:26:45, 1, pid=17083] smbd/service.c:close_cnum(650)
> trainer (18.104.22.168) closed connection to service scollins
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:get_socket_addr(1031)
> getpeername failed. Error was Transport endpoint is not connected
> [2001/08/29 17:26:45, 1, pid=17083] lib/util_sock.c:get_socket_name(996)
> Gethostbyaddr failed for 0.0.0.0
> The strange thing is that I'm not finding a cause for the connection
> resets. A packet trace, using Solaris's snoop command, doesn't reveal any
> normal RST condition which in turn would cause the connection to reset.
The word is being used in two senses: sending a
packet with RST set, meaning the other end wants
to tear down the connection right now, or finding
the connection torn down unexpectedly.
I expect we're seeing the latter...
> Looking at a truss of one of these processes, I find data like what
> 7776: 173.2877 poll(0x08047884, 3, 60000) = 1
> 7776: 173.2880 read(5, "\0\0\0 3", 4) = 4
> 7776: 173.2882 read(5, "FF S M B1A\0\0\0\0\0\0\0".., 51) = 51
> 7776: 173.2883 gettimeofday(0x081DF1C4) = 0
> 7776: 173.2884 fstat64(25, 0x08047908) = 0
> 7776: 173.2885 llseek(25, 2809801, SEEK_SET) = 2809801
> 7776: 173.2886 read(25, 0x082222D5, -3070) Err#22 EINVAL
First, -3070 sounds odd: the third arg is a size_t,
which is an unsigned.
Second, EINVAL is supposed to mean "An attempt was made
to read from a stream linked to a multiplexor"
(back from Unix v6/v7); this is a Sun bug, I suspect.
Ooops, never mind, neither is relevant, as fd 15
isn't to the client: it's fd 5 (below)
> 7776: 173.2887 write(5, "\001FF", 3) = 3
> 7776: poll(0x08047884, 3, 60000) (sleeping...)
> 7776: 219.6291 poll(0x08047884, 3, 60000) = 1
> 7776: 219.6298 read(5, 0x08211E89, 4) Err#131ECONNRESET
This is the second sense of reset: the other end
> 7776: 219.6299 time() = 999108532
> The ECONNRESET from the read call directly corresponds to the connection
> reset in the logs. Does anyone have a suggestion as to what could be
> causing the ECONNRESET? I can't find any evidence from snoop,
It should show a sudden cessation of packets from the
client after the write(5, "\001FF", 3), or just
possibly a packet with FIN set (a half-close).
If not, can you mail me the raw snoop?
Does Solaris have a bug that can
> generate spurios ECONNRESET messages? Can anyone think of a possible
> workaround if this is the case?
Not in the sense we mean: the man page is bogus,
though, as it doesn't admit you can get an ECONNRESET
(reported to Julia, in hopes it's soemthing she
The common case is a user switching off their machine
or getting a blue screen of death, then rebooting.
Both cause a Samba process to see wierd stuff, and
report resets, and the reboot causes a flurry of
releasing the dead machine's oplocks and giving them
to the ressurected machine's samba process.
We've also seen routers, hubs and ethernet cards fail
and produce thsi symptom.
David Collier-Brown, | Always do right. This will gratify
Performance & Engineering Team | some people and astonish the rest.
Americas Customer Engineering | -- Mark Twain
(905) 415-2849 | davecb at canada.sun.com
More information about the samba-technical