Solaris generating ECONNRESET causing Samba failures

Wed Aug 29 22:07:16 GMT 2001

We have a group of servers all running Solaris (2.6 and 2.8) with samba.
To circumvent locking problems, I've turned off oplocks on some of these
that are running 2.2.x.  Now, I'm seeing client problems that I think I've
traced back to the source.  In the log files, I'm seeing lines that
reference "read_sock_data: recv failure for 4.  Error = Connection reset
by peer" on an ongoing basis.  When this happens, things go haywire.  I've
included a log snippet for viewing.

[2001/08/29 17:13:31, 1, pid=17083] smbd/service.c:make_connection(610)
  trainer (147.138.20.16) connect to service scollins as user scollins
(uid=1038, gid=14) (pid 17083)
[2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:read_socket_data(478)
  read_socket_data: recv failure for 4. Error = Connection reset by peer
[2001/08/29 17:26:45, 1, pid=17083] smbd/service.c:close_cnum(650)
  trainer (147.138.20.16) closed connection to service scollins
[2001/08/29 17:26:45, 0, pid=17083] lib/util_sock.c:get_socket_addr(1031)
  getpeername failed. Error was Transport endpoint is not connected
[2001/08/29 17:26:45, 1, pid=17083] lib/util_sock.c:get_socket_name(996)
  Gethostbyaddr failed for 0.0.0.0

The strange thing is that I'm not finding a cause for the connection
resets.  A packet trace, using Solaris's snoop command, doesn't reveal any
normal RST condition which in turn would cause the connection to reset.
Looking at a truss of one of these processes, I find data like what
follows:

7776:   173.2877        poll(0x08047884, 3, 60000)                      = 1
7776:   173.2880        read(5, "\0\0\0 3", 4)                          = 4
7776:   173.2882        read(5, "FF S M B1A\0\0\0\0\0\0\0".., 51)       = 51
7776:   173.2883        gettimeofday(0x081DF1C4)                        = 0
7776:   173.2884        fstat64(25, 0x08047908)                         = 0
7776:   173.2885        llseek(25, 2809801, SEEK_SET)                   = 2809801
7776:   173.2886        read(25, 0x082222D5, -3070)                     Err#22 EINVAL
7776:   173.2887        write(5, "\001FF", 3)                           = 3
7776:   poll(0x08047884, 3, 60000)      (sleeping...)
7776:   219.6291        poll(0x08047884, 3, 60000)                      = 1
7776:   219.6298        read(5, 0x08211E89, 4)                          Err#131ECONNRESET
7776:   219.6299        time()                                          = 999108532

The ECONNRESET from the read call directly corresponds to the connection
reset in the logs.  Does anyone have a suggestion as to what could be
causing the ECONNRESET?  I can't find any evidence from snoop, interface
statistics on the switch and host, or a network sniffer that accounts for
the resets.  They're appearing on the group of servers which vary in
hardware so that's not a commonality.  Does Solaris have a bug that can
generate spurios ECONNRESET messages?  Can anyone think of a possible
workaround if this is the case?

Scott

------------------------------------------------------------------------
 Scott Moomaw, Network Administrator              Scott at Bridgewater.edu
 Bridgewater College, IT Center
 Bridgewater, VA  22812
 Phone (540) 828 - 8000  x5437              FAX:  (540) 828 - 5493