SIGUSR1 or SIGINT error

Dave Dykstra dwd at bell-labs.com
Wed Feb 13 02:57:34 EST 2002


On Mon, Feb 11, 2002 at 08:15:46PM -0500, David Birnbaum wrote:
> Dave, et. al,
> 
> Well, I decided to burn some disk space and did a truss of the whole
> thing...and caught it in the act.  It looks as though the first process
> gets an EPIPE.  If memory serves, EPIPE should only occur when the child
> has died on the other side!  Which also gets an EPIPE, amazingly enough.

Not exactly -- an EPIPE/SIGPIPE comes from trying to write to a pipe which
has no reader.  It's not necessarily a child.


> I'm attaching the relevant truss's, along with the appropriate FD mapping
> to hopefully shed some light on this.  It's obvious the client hasn't
> died, since it's still around to receive the error message.
> 
> The processes are as follows:
> 
>            CLIENT                         SERVER
>   corona PID 26802:TCP/40000  ->  hite PID 13173:TCP/rsync
>                                              |
>                                        PID 13023
> 
> If I understand things, I see 13173 do a write to FD5, which is the TCP
> connection to corona. 

I actually don't see a write to FD5 until after first the SIGPIPE is received,
but since it afterwards tries to write to FD5 and gets an EPIPE it's a pretty
good guess that it was a write to FD5 earlier in the truss which was the
culprit.

It seems odd to me that an EPIPE/SIGPIPE is coming from writing to
a TCP connection.  I thought those only came from pipes.  Ah, from the 
Solaris write(2) man page under EPIPE it says 
    (or to a file descriptor created by socket(3N), using type SOCK_STREAM
    that is no longer connected to a peer endpoint)

> It then sends the USR1 kill to 13203, which
> politely dies.
> 
> Meanwhile, on corona, we see a write on FD4, which also gets an EPIPE.
> It then dies.  It's clear that before the write failed, it was waiting for
> a while (you see the sleeping system calls).
> 
> This is reproducable at will (although it takes a bit of disk space to do
> so), so let me know if there's any specific thing that needs to be focused
> in on.  I'm especially curious as to which side gets the EPIPE first (and
> how), since we should really see one of the processes dying in some other
> fashion first, unless I misunderstand the man description of EPIPE.
> Solaris bug?  Unlikely, but certainly possible (wouldn't be the first
> time, but it's a helloflot harder to track down).

Yes, it would be helpful to know which side gets the EPIPE first.  Use the
truss -d option to add a timestamp.  That makes a timestamp relative to the
start of the trace on each side so they'll be a little off from each other
but if you start them both about the same time that should tell us which
one was first.

It appears that the TCP connection is getting torn down first.  Next time
you post watch the netstat output on both sides to see what's happening
with the TCP queues for that connection just before it dies, especially if
it freezes for a while before dying.  Also tell us the complete command
line of the rsync command you're using.  Although it may seem unlikely, a
Solaris bug is a possibility because rsync greatly stresses TCP
implementations and has exposed many TCP bugs in the past.  What version of
Solaris is running on each side, and are they up to date on their patches?
If netstat shows data in a send queue that is not being sent, that points
to a TCP problem.

- Dave Dykstra



...
> 13203:	write(9, " ( c o m p a t i b l e ;".., 1024)	= 1024
> 13203:	write(9, "                        ".., 1024)	= 1024
> 13203:	write(9, " n d o w s   9 8 ;   D i".., 448)	= 448
> 13203:	poll(0xFFBECF20, 1, 60000)			= 1
> 13203:	read(5, "05F5FFFF", 4)				= 4
> 13173:	    Received signal #13, SIGPIPE [caught]
> 13203:	write(9, " z i l l a / 4 . 6 1   (".., 1024)	= 1024
> 13173:	sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330)	= 0
> 13173:	sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330)	= 0
> 13173:	getpid()					= 13173 [12263]
> 13173:	kill(13203, SIGUSR1)				= 0
> 13173:	getpid()					= 13173 [12263]
> 13173:	poll(0xFFBEABB0, 2, 60000)			= 1
> 13173:	write(5, " B\0\0\b r s y n c   e r".., 70)	Err#32 EPIPE
> 13173:	    Received signal #13, SIGPIPE [default]
> 13173:		*** process killed ***
> 13203:	    Received signal #16, SIGUSR1 [caught]
> 13203:	      siginfo: SIGUSR1 pid=13173 uid=0
> 13203:	sigaction(SIGUSR1, 0xFFBED118, 0xFFBED198)	= 0
> 13203:	sigaction(SIGUSR2, 0xFFBED118, 0xFFBED198)	= 0
> 13203:	unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.visitorid.idx.51ayYz") = 0
> 13203:	getpid()					= 13203 [1]
> 13203:	write(10, " A\0\0\b r s y n c   e r".., 69)	Err#32 EPIPE
> 13203:	    Received signal #13, SIGPIPE [caught]
> 13203:	sigaction(SIGUSR1, 0xFFBEC840, 0xFFBEC8C0)	= 0
> 13203:	sigaction(SIGUSR2, 0xFFBEC840, 0xFFBEC8C0)	= 0
> 13203:	write(10, " A\0\0\b r s y n c   e r".., 69)	Err#32 EPIPE
> 13203:	    Received signal #13, SIGPIPE [default]
> 13203:		*** process killed ***

...
> 26802:	poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 26802:	poll(0xFFBE5CA0, 1, 60000)			= 1
> 26802:	write(4, "8DF3FFFF", 4)				Err#32 EPIPE
> 26802:	    Received signal #13, SIGPIPE [caught]
> 26802:	sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8)	= 0
> 26802:	sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8)	= 0
> 26802:	write(2, " r s y n c   e r r o r :".., 66)	= 66
> 26802:	mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> 26802:	munmap(0xFF310000, 8192)			= 0
> 26802:	llseek(0, 0, SEEK_CUR)				= 4322
> 26802:	_exit(20)

> COMMAND   PID USER   FD   TYPE        DEVICE  SIZE/OFF    NODE NAME
> rsync   26802 root  cwd   VDIR          85,0      1024       2 /
> rsync   26802 root  txt   VREG        32,359   2358212 5901093 /export/opt/rsync/2.5.1/bin/rsync
> rsync   26802 root  txt   VREG          85,0   1136748   25111 /usr/lib/libc.so.1
> rsync   26802 root  txt   VREG          85,0    888532   25069 /usr/lib/libnsl.so.1
> rsync   26802 root  txt   VREG          85,0     17096  148997 /usr/platform/sun4u/lib/libc_psr.so.1
> rsync   26802 root  txt   VREG          85,0     24968   25028 /usr/lib/libmp.so.2
> rsync   26802 root  txt   VREG          85,0     70864   25230 /usr/lib/libsocket.so.1
> rsync   26802 root  txt   VREG          85,0    382600   24999 / (/dev/md/dsk/d0)
> rsync   26802 root  txt   VREG          85,0      4628   25116 /usr/lib/libdl.so.1
> rsync   26802 root  txt   VREG          85,0    196996   25279 /usr/lib/ld.so.1
> rsync   26802 root    0u  VCHR         24,52    0t4256  318599 /devices/pseudo/pts at 0:52->ttcompat->ldterm->ptem->pts
> rsync   26802 root    1u  FIFO 0x30006508b08       0t1   45559 (fifofs) PIPE->0x30006508a20
> rsync   26802 root    2u  FIFO 0x30006508b08       0t1   45559 (fifofs) PIPE->0x30006508a20
> rsync   26802 root    3r  DOOR         230,0       0t0   50304 /etc/.name_service_door (door to nscd[322])
> rsync   26802 root    4u  IPv4 0x300086f8188 0t4438856     TCP corona:40000->hite:rsync (ESTABLISHED)
> rsync   26802 root    5r  VDIR          85,0      1024       2 /
> rsync   26802 root    6r  VDIR        32,359       512       2 /export
> rsync   26802 root    7r  VDIR        32,359      1536 7180800 /export/home2
> rsync   26802 root    8r  VDIR        32,359      2048 1139432 /export/home2/ckee
> rsync   26802 root    9r  VDIR        32,359       512 1651495 /export/home2/ckee/plautus
> rsync   26802 root   10r  VDIR        32,359       512 1657831 /export/home2/ckee/plautus/csfb
> rsync   26802 root   11r  VDIR        32,359       512 1664232 /export/home2/ckee/plautus/csfb/Reporting
> rsync   26802 root   12r  VDIR        32,359      3584 1683413 /export/home2/ckee/plautus/csfb/Reporting/csfb
> rsync   26802 root   13r  VDIR        32,359      1536 1709119 /export/home2/ckee/plautus/csfb/Reporting/csfb/Swiss

> COMMAND   PID USER   FD   TYPE        DEVICE  SIZE/OFF   NODE NAME
> rsync   13173 root  cwd   VDIR        32,144       512      2 /export/backup
> rsync   13173 root  rtd   VDIR        32,144       512      2 /export/backup
> rsync   13173 root  txt   VREG          85,0   2358212 135739 / (/dev/md/dsk/d0)
> rsync   13173 root  txt   VREG          85,0   1136748  18499 /usr/lib/libc.so.1
> rsync   13173 root  txt   VREG          85,0    888532  17879 /usr/lib/libnsl.so.1
> rsync   13173 root  txt   VREG          85,0     17096 153092 /usr/platform/sun4u/lib/libc_psr.so.1
> rsync   13173 root  txt   VREG          85,0     24968  17910 /usr/lib/libmp.so.2
> rsync   13173 root  txt   VREG          85,0     70864  17889 /usr/lib/libsocket.so.1
> rsync   13173 root  txt   VREG          85,0    382600  17981 / (/dev/md/dsk/d0)
> rsync   13173 root  txt   VREG          85,0      4628  17924 /usr/lib/libdl.so.1
> rsync   13173 root  txt   VREG          85,0    196996  17902 /usr/lib/ld.so.1
> rsync   13173 root    0u  VCHR          13,2       0t0 306289 /devices/pseudo/mm at 0:null
> rsync   13173 root    1u  VCHR          13,2       0t0 306289 /devices/pseudo/mm at 0:null
> rsync   13173 root    2u  VCHR          13,2       0t0 306289 /devices/pseudo/mm at 0:null
> rsync   13173 root    3w  VCHR          21,0       0t0 306285 /devices/pseudo/log at 0:conslog->LOG
> rsync   13173 root    4r  DOOR         224,0       0t0  22422 /etc/.name_service_door (door to nscd[213])
> rsync   13173 root    5u  IPv4 0x300023cee38 0t4370729    TCP hite->corona:40000 (ESTABLISHED)
> rsync   13173 root    6uw VREG         85,30         0 541128 /export/local/etc/rsync/rsyncd.lock





More information about the rsync mailing list