SIGUSR1 or SIGINT error
Dave Dykstra
dwd at bell-labs.com
Sat Feb 23 06:14:17 EST 2002
On Fri, Feb 22, 2002 at 12:10:52PM -0500, David Birnbaum wrote:
> On Fri, 22 Feb 2002, Dave Dykstra wrote:
>
> > [adding the mailing list to the Cc after David sent me trusses and other
> > info about his hang. I have trimmed down the tracing info to the relevant
> > pieces.]
> >
> > On Tue, Feb 19, 2002 at 11:51:27PM -0500, David Birnbaum wrote:
> > > I wasn't able to get a netstat this time. However, I should be able to
> > > grab it again if these truss's don't help. I'm thinking I will run a
> > > snoop the next time and capture the packet traffic to see what's happening
> > > at the TCP level - at least we should be able to see who/what shut down
> > > first, because from the truss, it looks like the kernel shut down the
> > > whole thing at once. It's tough to time the netstat because it takes a
> > > while for it to die.
> >
> > The client is doing nothing complex, it's just merrilly blasting away down
> > file descriptor 4 and never reading it. It then hangs and the only reason
> > it wakes up is because the connection has been torn down.
> >
> > The server process looks very odd. It is doing a 196K read from a file, and
> > it takes 9 minutes to return from that, and when it wakes up from that it
> > finds that an EPIPE from the connection to the rsync client. Why would a
> > 196K read take 9 minutes? It looks like it is going over NFS, so maybe the
> > problem is with NFS. What's the network between the NFS server and the
> > rsync server? Is it possible to run rsync on the NFS server itself, or is
> > that a network appliance or something? It could be that the TCP
> > implementation on the client side wasn't willing to wait for a response for
> > 9 minutes and that's why it shut things down with an EPIPE.
> >
> > At the same time, the server child process is doing a 262K read from another
> > file over the same NFS connection and it takes the same 9 minutes and
> > eventually suceeds. It only dies because the other server process had died
> > and it gets a signal.
>
> What makes you think it's going over NFS? Everything being backed up is
> local, and the backup disk is local...as best as I can tell. The lsof
> stuff all points to local devices as well.
I just thought /export was likely to be a remote disk, and I couldn't imagine
it taking 9 minutes to read a local disk.
Hold on... I misread the trace. I just did some simple tests and I found
that truss timestamps are actually the time that an operation completes,
not the time they start. So the long delays are in the operations
following the reads, not the reads themselves. That changes everything.
New analysis: the client (the sender) is waiting on a poll to write to the
network. The server main process (the generator) is waiting on a poll to
write to the network in the other direction and possibly also to read from
the socketpair inter-process communication from it's child process (the
receiver). If you use truss "-vpoll" it should tell you exactly what's
being waited on. The server child process (the receiver) is waiting on
a 928 byte write to the file being built on the disk! It gets stranger and
stranger. Is the disk almost full or something?
> The network is a 100MB switched network, and while the backup server is
> busy at night doing the backups, when I was running the truss neither
> server was particularly busy.
>
> The plot thickens...it's worked two days in a row now. But, the next time
> it fails I will get a snoop and see if TCP lends any clues. Is it
> possible that there's some sort of memory clearing, cleanup, or other
> event going on that could cause the occassional lapse like this? We
> certainly haven't seen any other networking problems that would explain
> this.
It still looks like a disk problem and not a network problem, so I don't
think a snoop is going to help you.
- Dave Dykstra
> Content-Description: client lsof
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
> rsync 20039 root 4u IPv4 0x3000dbd3810 0t64514407 TCP corona.chelsea.net:51816->hite.chelsea.net:rsync (ESTABLISHED)
> rsync 20039 root 5r VREG 32,359 52273152 243313 /export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
>
> Content-Description: client truss
...
> 3504.6000 write(4, " w s 9 8 ; V Z _ I E".., 1021) = 1021
> 3504.6003 poll(0xFFBE5CA0, 1, 60000) = 1
> 3504.6005 write(4, "03FCFFFF", 4) = 4
> 3504.6010 poll(0xFFBE5CA0, 1, 60000) = 1
> 3504.6012 write(4, "02FCFFFF", 4) = 4
> 3504.6016 poll(0xFFBE5CA0, 1, 60000) = 1
> 3504.6019 write(4, "01FCFFFF", 4) = 4
> 3504.6034 poll(0xFFBE5CA0, 1, 60000) = 1
> 3504.6036 write(4, "E306\0\0", 4) = 4
> 3504.6041 poll(0xFFBE5D18, 1, 60000) = 1
> 3504.6046 write(4, " s t 1 0 3 . t n t 4 . r".., 1763) = 1763
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3564.6045 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3624.6045 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3684.6045 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3744.6046 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3804.6046 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3864.6044 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3924.6041 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 3984.6041 poll(0xFFBE5CA0, 1, 60000) = 0
> poll(0xFFBE5CA0, 1, 60000) (sleeping...)
> 4037.3040 poll(0xFFBE5CA0, 1, 60000) = 1
> 4037.3046 write(4, "FEFBFFFF", 4) Err#32 EPIPE
> 4037.3055 Received signal #13, SIGPIPE [caught]
> 4037.3064 sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8) = 0
> 4037.3071 sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8) = 0
> 4037.3074 write(2, " r s y n c e r r o r :".., 66) = 66
> 4037.3084 mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> 4037.3088 munmap(0xFF310000, 8192) = 0
> 4037.3091 llseek(0, 0, SEEK_CUR) = 583129
> 4037.3094 _exit(20)
>
>
> Content-Description: server rsync lsof
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
> rsync 27948 root 5u IPv4 0x30000a59710 0t64477429 TCP hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync 27948 root 6uw VREG 85,30 0 541128 /export (/dev/md/dsk/d30)
> rsync 27948 root 7r VREG 32,144 19062784 5347011 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.idx
>
> Content-Description: server truss
...
> 3497.9255 lstat64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat", 0xFFBEE370) = 0
> 3497.9262 open64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat", O_RDONLY) = 7
> 3497.9869 read(7, "\002FFB7\0\0\0\0\002E0EF".., 196536) = 196536
> poll(0xFFBEC6E0, 2, 60000) (sleeping...)
> 4037.2043 poll(0xFFBEC6E0, 2, 60000) = 0
> 4037.2049 poll(0xFFBEC6E0, 2, 60000) = 1
> 4037.2052 write(5, "FC0F\0078E hB40698 .84EF".., 4096) Err#32 EPIPE
> 4037.2816 Received signal #13, SIGPIPE [caught]
> 4037.2843 sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330) = 0
> 4037.2848 sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330) = 0
> 4037.2851 getpid() = 27948 [27130]
> 4037.3504 kill(28192, SIGUSR1) = 0
> 4037.3508 getpid() = 27948 [27130]
> 4037.3514 poll(0xFFBEABB0, 2, 60000) = 1
> 4037.3518 write(5, " B\0\0\b r s y n c e r".., 70) Err#32 EPIPE
> 4037.3523 Received signal #13, SIGPIPE [default]
> *** process killed ***
>
>
> Content-Description: server child lsof
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
> rsync 28192 root 5u IPv4 0x30000a59710 0t64477429 TCP hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync 28192 root 6u VREG 85,30 0 541128 /export (/dev/md/dsk/d30)
> rsync 28192 root 7r VREG 32,144 51363840 5347010 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
> rsync 28192 root 8u unix 105,36 0t0 306297 /devices/pseudo/tl at 0:ticots->(socketpair: 0x18d400000000) (0x300016fd448)
> rsync 28192 root 9u VREG 32,144 37569724 5347009 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.view.dat.Q9aae3
> rsync 28192 root 10u unix 105,39 0t0 306297 /devices/pseudo/tl at 0:ticots->(socketpair: 0x18d600000000) (0x300016fc368)
>
> Content-Description: server child truss
...
> 3504.2667 poll(0xFFBECF20, 1, 60000) = 1
> 3504.2682 read(5, "D2\t\0\0", 4) = 4
> 3504.2699 poll(0xFFBECF98, 1, 60000) = 1
> 3504.2714 read(5, " I E 6 . 0 ; W i n d".., 2514) = 2514
> 3504.2738 write(9, " I E 6 . 0 ; W i n d".., 1024) = 1024
> 3504.2755 write(9, " . 0 ; W i n d o w s ".., 1024) = 1024
> 3504.2771 write(9, " w e r P C )\0 M9BCE9A 1".., 466) = 466
> 3504.2787 poll(0xFFBECF20, 1, 60000) = 1
> 3504.2800 read(5, "E4E6FFFF", 4) = 4
> 3504.2822 llseek(7, 5898240, SEEK_SET) = 5898240
> 3517.5204 read(7, "\0 Y \0\0\0\0\0\0 E16C5".., 262144) = 262144
> 4037.2227 write(9, " d o w s 9 5 )\0 T\v C".., 928) = 928
> 4037.2233 poll(0xFFBECF20, 1, 60000) = 1
> 4037.2238 read(5, "E3E6FFFF", 4) = 4
> 4037.2244 write(9, " 4 . 0 ( c o m p a t i".., 928) = 928
> 4037.2258 poll(0xFFBECF20, 1, 60000) = 1
> 4037.2262 read(5, "E2E6FFFF", 4) = 4
> 4037.2269 write(9, " c o m : M o z i l l a /".., 928) = 928
> 4037.2274 poll(0xFFBECF20, 1, 60000) = 1
> 4037.2278 read(5, "0E1A\0\0", 4) = 4
> 4037.2283 poll(0xFFBECF98, 1, 60000) = 1
> 4037.2288 read(5, " M o z i l l a / 4 . 0 ".., 6670) = 6670
> 4037.2304 write(9, " M o z i l l a / 4 . 0 ".., 1024) = 1024
> 4037.3011 write(9, " . c o m : M o z i l l a".., 1024) = 1024
> 4037.3017 write(9, " 2 4 7 - 2 9 - 0 . c l i".., 1024) = 1024
> 4037.3022 write(9, " 1 ; Q 3 1 2 4 6 1 )\0".., 1024) = 1024
> 4037.3028 write(9, " M S I E C r a w l e r )".., 1024) = 1024
> 4037.3079 llseek(9, 1, SEEK_CUR) = 585779
> 4037.3085 write(9, " WB7 lA0 1 2 - 2 5 5 - 1".., 1023) = 1023
> 4037.3125 write(9, " 0 7 2 1 2 0 0 0 h . 0 8".., 526) = 526
> 4037.3636 poll(0xFFBECF20, 1, 60000) = 1
> 4037.3642 read(5, " %DFFFFF", 4) = 4
> 4037.3650 llseek(7, 7733248, SEEK_SET) = 7733248
> 4037.6213 read(7, "\01A @\0\01A \0\0 6109E".., 262144) = 262144
> 4037.6229 Received signal #16, SIGUSR1 [caught]
> siginfo: SIGUSR1 pid=27948 uid=0
> 4037.6234 sigaction(SIGUSR1, 0xFFBED188, 0xFFBED208) = 0
> 4037.6238 sigaction(SIGUSR2, 0xFFBED188, 0xFFBED208) = 0
> 4037.6913 unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/oral-forum-www.thebody.com/.visitorid.idx.Lkbae3") = 0
> 4037.6921 getpid() = 28192 [1]
> 4037.6957 write(10, " A\0\0\b r s y n c e r".., 69) Err#32 EPIPE
> 4037.6977 Received signal #13, SIGPIPE [caught]
> 4037.6982 sigaction(SIGUSR1, 0xFFBEC8B0, 0xFFBEC930) = 0
> 4037.7016 sigaction(SIGUSR2, 0xFFBEC8B0, 0xFFBEC930) = 0
> 4037.7021 write(10, " A\0\0\b r s y n c e r".., 69) Err#32 EPIPE
> 4037.7026 Received signal #13, SIGPIPE [default]
> *** process killed ***
More information about the rsync
mailing list