SIGUSR1 or SIGINT error
Dave Dykstra
dwd at bell-labs.com
Wed Feb 13 02:57:34 EST 2002
On Mon, Feb 11, 2002 at 08:15:46PM -0500, David Birnbaum wrote:
> Dave, et. al,
>
> Well, I decided to burn some disk space and did a truss of the whole
> thing...and caught it in the act. It looks as though the first process
> gets an EPIPE. If memory serves, EPIPE should only occur when the child
> has died on the other side! Which also gets an EPIPE, amazingly enough.
Not exactly -- an EPIPE/SIGPIPE comes from trying to write to a pipe which
has no reader. It's not necessarily a child.
> I'm attaching the relevant truss's, along with the appropriate FD mapping
> to hopefully shed some light on this. It's obvious the client hasn't
> died, since it's still around to receive the error message.
>
> The processes are as follows:
>
> CLIENT SERVER
> corona PID 26802:TCP/40000 -> hite PID 13173:TCP/rsync
> |
> PID 13023
>
> If I understand things, I see 13173 do a write to FD5, which is the TCP
> connection to corona.
I actually don't see a write to FD5 until after first the SIGPIPE is received,
but since it afterwards tries to write to FD5 and gets an EPIPE it's a pretty
good guess that it was a write to FD5 earlier in the truss which was the
culprit.
It seems odd to me that an EPIPE/SIGPIPE is coming from writing to
a TCP connection. I thought those only came from pipes. Ah, from the
Solaris write(2) man page under EPIPE it says
(or to a file descriptor created by socket(3N), using type SOCK_STREAM
that is no longer connected to a peer endpoint)
> It then sends the USR1 kill to 13203, which
> politely dies.
>
> Meanwhile, on corona, we see a write on FD4, which also gets an EPIPE.
> It then dies. It's clear that before the write failed, it was waiting for
> a while (you see the sleeping system calls).
>
> This is reproducable at will (although it takes a bit of disk space to do
> so), so let me know if there's any specific thing that needs to be focused
> in on. I'm especially curious as to which side gets the EPIPE first (and
> how), since we should really see one of the processes dying in some other
> fashion first, unless I misunderstand the man description of EPIPE.
> Solaris bug? Unlikely, but certainly possible (wouldn't be the first
> time, but it's a helloflot harder to track down).
Yes, it would be helpful to know which side gets the EPIPE first. Use the
truss -d option to add a timestamp. That makes a timestamp relative to the
start of the trace on each side so they'll be a little off from each other
but if you start them both about the same time that should tell us which
one was first.
It appears that the TCP connection is getting torn down first. Next time
you post watch the netstat output on both sides to see what's happening
with the TCP queues for that connection just before it dies, especially if
it freezes for a while before dying. Also tell us the complete command
line of the rsync command you're using. Although it may seem unlikely, a
Solaris bug is a possibility because rsync greatly stresses TCP
implementations and has exposed many TCP bugs in the past. What version of
Solaris is running on each side, and are they up to date on their patches?
If netstat shows data in a send queue that is not being sent, that points
to a TCP problem.
- Dave Dykstra
...
> 13203: write(9, " ( c o m p a t i b l e ;".., 1024) = 1024
> 13203: write(9, " ".., 1024) = 1024
> 13203: write(9, " n d o w s 9 8 ; D i".., 448) = 448
> 13203: poll(0xFFBECF20, 1, 60000) = 1
> 13203: read(5, "05F5FFFF", 4) = 4
> 13173: Received signal #13, SIGPIPE [caught]
> 13203: write(9, " z i l l a / 4 . 6 1 (".., 1024) = 1024
> 13173: sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330) = 0
> 13173: sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330) = 0
> 13173: getpid() = 13173 [12263]
> 13173: kill(13203, SIGUSR1) = 0
> 13173: getpid() = 13173 [12263]
> 13173: poll(0xFFBEABB0, 2, 60000) = 1
> 13173: write(5, " B\0\0\b r s y n c e r".., 70) Err#32 EPIPE
> 13173: Received signal #13, SIGPIPE [default]
> 13173: *** process killed ***
> 13203: Received signal #16, SIGUSR1 [caught]
> 13203: siginfo: SIGUSR1 pid=13173 uid=0
> 13203: sigaction(SIGUSR1, 0xFFBED118, 0xFFBED198) = 0
> 13203: sigaction(SIGUSR2, 0xFFBED118, 0xFFBED198) = 0
> 13203: unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.visitorid.idx.51ayYz") = 0
> 13203: getpid() = 13203 [1]
> 13203: write(10, " A\0\0\b r s y n c e r".., 69) Err#32 EPIPE
> 13203: Received signal #13, SIGPIPE [caught]
> 13203: sigaction(SIGUSR1, 0xFFBEC840, 0xFFBEC8C0) = 0
> 13203: sigaction(SIGUSR2, 0xFFBEC840, 0xFFBEC8C0) = 0
> 13203: write(10, " A\0\0\b r s y n c e r".., 69) Err#32 EPIPE
> 13203: Received signal #13, SIGPIPE [default]
> 13203: *** process killed ***
...
> 26802: poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 26802: poll(0xFFBE5CA0, 1, 60000) = 1
> 26802: write(4, "8DF3FFFF", 4) Err#32 EPIPE
> 26802: Received signal #13, SIGPIPE [caught]
> 26802: sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8) = 0
> 26802: sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8) = 0
> 26802: write(2, " r s y n c e r r o r :".., 66) = 66
> 26802: mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> 26802: munmap(0xFF310000, 8192) = 0
> 26802: llseek(0, 0, SEEK_CUR) = 4322
> 26802: _exit(20)
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> rsync 26802 root cwd VDIR 85,0 1024 2 /
> rsync 26802 root txt VREG 32,359 2358212 5901093 /export/opt/rsync/2.5.1/bin/rsync
> rsync 26802 root txt VREG 85,0 1136748 25111 /usr/lib/libc.so.1
> rsync 26802 root txt VREG 85,0 888532 25069 /usr/lib/libnsl.so.1
> rsync 26802 root txt VREG 85,0 17096 148997 /usr/platform/sun4u/lib/libc_psr.so.1
> rsync 26802 root txt VREG 85,0 24968 25028 /usr/lib/libmp.so.2
> rsync 26802 root txt VREG 85,0 70864 25230 /usr/lib/libsocket.so.1
> rsync 26802 root txt VREG 85,0 382600 24999 / (/dev/md/dsk/d0)
> rsync 26802 root txt VREG 85,0 4628 25116 /usr/lib/libdl.so.1
> rsync 26802 root txt VREG 85,0 196996 25279 /usr/lib/ld.so.1
> rsync 26802 root 0u VCHR 24,52 0t4256 318599 /devices/pseudo/pts at 0:52->ttcompat->ldterm->ptem->pts
> rsync 26802 root 1u FIFO 0x30006508b08 0t1 45559 (fifofs) PIPE->0x30006508a20
> rsync 26802 root 2u FIFO 0x30006508b08 0t1 45559 (fifofs) PIPE->0x30006508a20
> rsync 26802 root 3r DOOR 230,0 0t0 50304 /etc/.name_service_door (door to nscd[322])
> rsync 26802 root 4u IPv4 0x300086f8188 0t4438856 TCP corona:40000->hite:rsync (ESTABLISHED)
> rsync 26802 root 5r VDIR 85,0 1024 2 /
> rsync 26802 root 6r VDIR 32,359 512 2 /export
> rsync 26802 root 7r VDIR 32,359 1536 7180800 /export/home2
> rsync 26802 root 8r VDIR 32,359 2048 1139432 /export/home2/ckee
> rsync 26802 root 9r VDIR 32,359 512 1651495 /export/home2/ckee/plautus
> rsync 26802 root 10r VDIR 32,359 512 1657831 /export/home2/ckee/plautus/csfb
> rsync 26802 root 11r VDIR 32,359 512 1664232 /export/home2/ckee/plautus/csfb/Reporting
> rsync 26802 root 12r VDIR 32,359 3584 1683413 /export/home2/ckee/plautus/csfb/Reporting/csfb
> rsync 26802 root 13r VDIR 32,359 1536 1709119 /export/home2/ckee/plautus/csfb/Reporting/csfb/Swiss
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> rsync 13173 root cwd VDIR 32,144 512 2 /export/backup
> rsync 13173 root rtd VDIR 32,144 512 2 /export/backup
> rsync 13173 root txt VREG 85,0 2358212 135739 / (/dev/md/dsk/d0)
> rsync 13173 root txt VREG 85,0 1136748 18499 /usr/lib/libc.so.1
> rsync 13173 root txt VREG 85,0 888532 17879 /usr/lib/libnsl.so.1
> rsync 13173 root txt VREG 85,0 17096 153092 /usr/platform/sun4u/lib/libc_psr.so.1
> rsync 13173 root txt VREG 85,0 24968 17910 /usr/lib/libmp.so.2
> rsync 13173 root txt VREG 85,0 70864 17889 /usr/lib/libsocket.so.1
> rsync 13173 root txt VREG 85,0 382600 17981 / (/dev/md/dsk/d0)
> rsync 13173 root txt VREG 85,0 4628 17924 /usr/lib/libdl.so.1
> rsync 13173 root txt VREG 85,0 196996 17902 /usr/lib/ld.so.1
> rsync 13173 root 0u VCHR 13,2 0t0 306289 /devices/pseudo/mm at 0:null
> rsync 13173 root 1u VCHR 13,2 0t0 306289 /devices/pseudo/mm at 0:null
> rsync 13173 root 2u VCHR 13,2 0t0 306289 /devices/pseudo/mm at 0:null
> rsync 13173 root 3w VCHR 21,0 0t0 306285 /devices/pseudo/log at 0:conslog->LOG
> rsync 13173 root 4r DOOR 224,0 0t0 22422 /etc/.name_service_door (door to nscd[213])
> rsync 13173 root 5u IPv4 0x300023cee38 0t4370729 TCP hite->corona:40000 (ESTABLISHED)
> rsync 13173 root 6uw VREG 85,30 0 541128 /export/local/etc/rsync/rsyncd.lock
More information about the rsync
mailing list