Solaris generating ECONNRESET causing Samba failures
David Collier-Brown
davecb at canada.sun.com
Thu Aug 30 13:16:29 GMT 2001
Scott Moomaw wrote:
> In the log files, I'm seeing lines that
> reference "read_sock_data: recv failure for 4. Error = Connection reset
> by peer" on an ongoing basis. When this happens, things go haywire. I've
> included a log snippet for viewing.
>
> [2001/08/29 17:13:31, 1, pid=17083] smbd/service.c:make_connection(610)
> trainer (147.138.20.16) connect to service scollins as user scollins
> (uid=1038, gid=14) (pid 17083)
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:read_socket_data(478)
> read_socket_data: recv failure for 4. Error = Connection reset by peer
> [2001/08/29 17:26:45, 1, pid=17083] smbd/service.c:close_cnum(650)
> trainer (147.138.20.16) closed connection to service scollins
> [2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:get_socket_addr(1031)
> getpeername failed. Error was Transport endpoint is not connected
> [2001/08/29 17:26:45, 1, pid=17083] lib/util_sock.c:get_socket_name(996)
> Gethostbyaddr failed for 0.0.0.0
>
> The strange thing is that I'm not finding a cause for the connection
> resets. A packet trace, using Solaris's snoop command, doesn't reveal any
> normal RST condition which in turn would cause the connection to reset.
The word is being used in two senses: sending a
packet with RST set, meaning the other end wants
to tear down the connection right now, or finding
the connection torn down unexpectedly.
I expect we're seeing the latter...
> Looking at a truss of one of these processes, I find data like what
> follows:
>
> 7776: 173.2877 poll(0x08047884, 3, 60000) = 1
> 7776: 173.2880 read(5, "\0\0\0 3", 4) = 4
> 7776: 173.2882 read(5, "FF S M B1A\0\0\0\0\0\0\0".., 51) = 51
> 7776: 173.2883 gettimeofday(0x081DF1C4) = 0
> 7776: 173.2884 fstat64(25, 0x08047908) = 0
> 7776: 173.2885 llseek(25, 2809801, SEEK_SET) = 2809801
> 7776: 173.2886 read(25, 0x082222D5, -3070) Err#22 EINVAL
First, -3070 sounds odd: the third arg is a size_t,
which is an unsigned.
Second, EINVAL is supposed to mean "An attempt was made
to read from a stream linked to a multiplexor"
(back from Unix v6/v7); this is a Sun bug, I suspect.
Ooops, never mind, neither is relevant, as fd 15
isn't to the client: it's fd 5 (below)
> 7776: 173.2887 write(5, "\001FF", 3) = 3
> 7776: poll(0x08047884, 3, 60000) (sleeping...)
> 7776: 219.6291 poll(0x08047884, 3, 60000) = 1
> 7776: 219.6298 read(5, 0x08211E89, 4) Err#131ECONNRESET
This is the second sense of reset: the other end
has disappeared!
> 7776: 219.6299 time() = 999108532
>
> The ECONNRESET from the read call directly corresponds to the connection
> reset in the logs. Does anyone have a suggestion as to what could be
> causing the ECONNRESET? I can't find any evidence from snoop,
It should show a sudden cessation of packets from the
client after the write(5, "\001FF", 3), or just
possibly a packet with FIN set (a half-close).
If not, can you mail me the raw snoop?
Does Solaris have a bug that can
> generate spurios ECONNRESET messages? Can anyone think of a possible
> workaround if this is the case?
Not in the sense we mean: the man page is bogus,
though, as it doesn't admit you can get an ECONNRESET
(reported to Julia, in hopes it's soemthing she
can confirm/fix).
The common case is a user switching off their machine
or getting a blue screen of death, then rebooting.
Both cause a Samba process to see wierd stuff, and
report resets, and the reboot causes a flurry of
releasing the dead machine's oplocks and giving them
to the ressurected machine's samba process.
We've also seen routers, hubs and ethernet cards fail
and produce thsi symptom.
--dave
--
David Collier-Brown, | Always do right. This will gratify
Performance & Engineering Team | some people and astonish the rest.
Americas Customer Engineering | -- Mark Twain
(905) 415-2849 | davecb at canada.sun.com
More information about the samba-technical
mailing list