SIGUSR1 or SIGINT error

Dave Dykstra dwd at bell-labs.com
Sat Feb 23 06:14:17 EST 2002


On Fri, Feb 22, 2002 at 12:10:52PM -0500, David Birnbaum wrote:
> On Fri, 22 Feb 2002, Dave Dykstra wrote:
> 
> > [adding the mailing list to the Cc after David sent me trusses and other
> > info about his hang.  I have trimmed down the tracing info to the relevant
> > pieces.]
> >
> > On Tue, Feb 19, 2002 at 11:51:27PM -0500, David Birnbaum wrote:
> > > I wasn't able to get a netstat this time.  However, I should be able to
> > > grab it again if these truss's don't help.  I'm thinking I will run a
> > > snoop the next time and capture the packet traffic to see what's happening
> > > at the TCP level - at least we should be able to see who/what shut down
> > > first, because from the truss, it looks like the kernel shut down the
> > > whole thing at once.  It's tough to time the netstat because it takes a
> > > while for it to die.
> >
> > The client is doing nothing complex, it's just merrilly blasting away down
> > file descriptor 4 and never reading it.  It then hangs and the only reason
> > it wakes up is because the connection has been torn down.
> >
> > The server process looks very odd.  It is doing a 196K read from a file, and
> > it takes 9 minutes to return from that, and when it wakes up from that it
> > finds that an EPIPE from the connection to the rsync client.  Why would a
> > 196K read take 9 minutes?  It looks like it is going over NFS, so maybe the
> > problem is with NFS.  What's the network between the NFS server and the
> > rsync server?  Is it possible to run rsync on the NFS server itself, or is
> > that a network appliance or something?  It could be that the TCP
> > implementation on the client side wasn't willing to wait for a response for
> > 9 minutes and that's why it shut things down with an EPIPE.
> >
> > At the same time, the server child process is doing a 262K read from another
> > file over the same NFS connection and it takes the same 9 minutes and
> > eventually suceeds.  It only dies because the other server process had died
> > and it gets a signal.
> 
> What makes you think it's going over NFS?  Everything being backed up is
> local, and the backup disk is local...as best as I can tell.  The lsof
> stuff all points to local devices as well.

I just thought /export was likely to be a remote disk, and I couldn't imagine
it taking 9 minutes to read a local disk.

Hold on... I misread the trace.  I just did some simple tests and I found
that truss timestamps are actually the time that an operation completes,
not the time they start.  So the long delays are in the operations
following the reads, not the reads themselves.  That changes everything.

New analysis: the client (the sender) is waiting on a poll to write to the
network.  The server main process (the generator) is waiting on a poll to
write to the network in the other direction and possibly also to read from
the socketpair inter-process communication from it's child process (the
receiver).  If you use truss "-vpoll" it should tell you exactly what's
being waited on.  The server child process (the receiver) is waiting on
a 928 byte write to the file being built on the disk!  It gets stranger and
stranger.  Is the disk almost full or something?



> The network is a 100MB switched network, and while the backup server is
> busy at night doing the backups, when I was running the truss neither
> server was particularly busy.
> 
> The plot thickens...it's worked two days in a row now.  But, the next time
> it fails I will get a snoop and see if TCP lends any clues.  Is it
> possible that there's some sort of memory clearing, cleanup, or other
> event going on that could cause the occassional lapse like this?  We
> certainly haven't seen any other networking problems that would explain
> this.

It still looks like a disk problem and not a network problem, so I don't
think a snoop is going to help you.

- Dave Dykstra





> Content-Description: client lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   20039 root    4u  IPv4 0x3000dbd3810 0t64514407     TCP corona.chelsea.net:51816->hite.chelsea.net:rsync (ESTABLISHED)
> rsync   20039 root    5r  VREG        32,359   52273152  243313 /export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
>
> Content-Description: client truss
...
> 3504.6000	write(4, " w s   9 8 ;   V Z _ I E".., 1021)	= 1021
> 3504.6003	poll(0xFFBE5CA0, 1, 60000)			= 1
> 3504.6005	write(4, "03FCFFFF", 4)				= 4
> 3504.6010	poll(0xFFBE5CA0, 1, 60000)			= 1
> 3504.6012	write(4, "02FCFFFF", 4)				= 4
> 3504.6016	poll(0xFFBE5CA0, 1, 60000)			= 1
> 3504.6019	write(4, "01FCFFFF", 4)				= 4
> 3504.6034	poll(0xFFBE5CA0, 1, 60000)			= 1
> 3504.6036	write(4, "E306\0\0", 4)				= 4
> 3504.6041	poll(0xFFBE5D18, 1, 60000)			= 1
> 3504.6046	write(4, " s t 1 0 3 . t n t 4 . r".., 1763)	= 1763
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3564.6045	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3624.6045	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3684.6045	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3744.6046	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3804.6046	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3864.6044	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3924.6041	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 3984.6041	poll(0xFFBE5CA0, 1, 60000)			= 0
> poll(0xFFBE5CA0, 1, 60000)	(sleeping...)
> 4037.3040	poll(0xFFBE5CA0, 1, 60000)			= 1
> 4037.3046	write(4, "FEFBFFFF", 4)				Err#32 EPIPE
> 4037.3055	    Received signal #13, SIGPIPE [caught]
> 4037.3064	sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8)	= 0
> 4037.3071	sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8)	= 0
> 4037.3074	write(2, " r s y n c   e r r o r :".., 66)	= 66
> 4037.3084	mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> 4037.3088	munmap(0xFF310000, 8192)			= 0
> 4037.3091	llseek(0, 0, SEEK_CUR)				= 583129
> 4037.3094	_exit(20)
>
>
> Content-Description: server rsync lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   27948 root    5u  IPv4 0x30000a59710 0t64477429     TCP hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync   27948 root    6uw VREG         85,30          0  541128 /export (/dev/md/dsk/d30)
> rsync   27948 root    7r  VREG        32,144   19062784 5347011 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.idx
>
> Content-Description: server truss
...
> 3497.9255	lstat64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat", 0xFFBEE370) = 0
> 3497.9262	open64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat", O_RDONLY) = 7
> 3497.9869	read(7, "\002FFB7\0\0\0\0\002E0EF".., 196536)	= 196536
> poll(0xFFBEC6E0, 2, 60000)	(sleeping...)
> 4037.2043	poll(0xFFBEC6E0, 2, 60000)			= 0
> 4037.2049	poll(0xFFBEC6E0, 2, 60000)			= 1
> 4037.2052	write(5, "FC0F\0078E hB40698 .84EF".., 4096)	Err#32 EPIPE
> 4037.2816	    Received signal #13, SIGPIPE [caught]
> 4037.2843	sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330)	= 0
> 4037.2848	sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330)	= 0
> 4037.2851	getpid()					= 27948 [27130]
> 4037.3504	kill(28192, SIGUSR1)				= 0
> 4037.3508	getpid()					= 27948 [27130]
> 4037.3514	poll(0xFFBEABB0, 2, 60000)			= 1
> 4037.3518	write(5, " B\0\0\b r s y n c   e r".., 70)	Err#32 EPIPE
> 4037.3523	    Received signal #13, SIGPIPE [default]
> 	*** process killed ***
>
>
> Content-Description: server child lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   28192 root    5u  IPv4 0x30000a59710 0t64477429     TCP hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync   28192 root    6u  VREG         85,30          0  541128 /export (/dev/md/dsk/d30)
> rsync   28192 root    7r  VREG        32,144   51363840 5347010 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
> rsync   28192 root    8u  unix        105,36        0t0  306297 /devices/pseudo/tl at 0:ticots->(socketpair: 0x18d400000000) (0x300016fd448)
> rsync   28192 root    9u  VREG        32,144   37569724 5347009 /export/backup -- export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.view.dat.Q9aae3
> rsync   28192 root   10u  unix        105,39        0t0  306297 /devices/pseudo/tl at 0:ticots->(socketpair: 0x18d600000000) (0x300016fc368)
>
> Content-Description: server child truss
...
> 3504.2667	poll(0xFFBECF20, 1, 60000)			= 1
> 3504.2682	read(5, "D2\t\0\0", 4)				= 4
> 3504.2699	poll(0xFFBECF98, 1, 60000)			= 1
> 3504.2714	read(5, " I E   6 . 0 ;   W i n d".., 2514)	= 2514
> 3504.2738	write(9, " I E   6 . 0 ;   W i n d".., 1024)	= 1024
> 3504.2755	write(9, " . 0 ;   W i n d o w s  ".., 1024)	= 1024
> 3504.2771	write(9, " w e r P C )\0 M9BCE9A 1".., 466)	= 466
> 3504.2787	poll(0xFFBECF20, 1, 60000)			= 1
> 3504.2800	read(5, "E4E6FFFF", 4)				= 4
> 3504.2822	llseek(7, 5898240, SEEK_SET)			= 5898240
> 3517.5204	read(7, "\0 Y  \0\0\0\0\0\0 E16C5".., 262144)	= 262144
> 4037.2227	write(9, " d o w s   9 5 )\0 T\v C".., 928)	= 928
> 4037.2233	poll(0xFFBECF20, 1, 60000)			= 1
> 4037.2238	read(5, "E3E6FFFF", 4)				= 4
> 4037.2244	write(9, " 4 . 0   ( c o m p a t i".., 928)	= 928
> 4037.2258	poll(0xFFBECF20, 1, 60000)			= 1
> 4037.2262	read(5, "E2E6FFFF", 4)				= 4
> 4037.2269	write(9, " c o m : M o z i l l a /".., 928)	= 928
> 4037.2274	poll(0xFFBECF20, 1, 60000)			= 1
> 4037.2278	read(5, "0E1A\0\0", 4)				= 4
> 4037.2283	poll(0xFFBECF98, 1, 60000)			= 1
> 4037.2288	read(5, " M o z i l l a / 4 . 0  ".., 6670)	= 6670
> 4037.2304	write(9, " M o z i l l a / 4 . 0  ".., 1024)	= 1024
> 4037.3011	write(9, " . c o m : M o z i l l a".., 1024)	= 1024
> 4037.3017	write(9, " 2 4 7 - 2 9 - 0 . c l i".., 1024)	= 1024
> 4037.3022	write(9, " 1 ;   Q 3 1 2 4 6 1 )\0".., 1024)	= 1024
> 4037.3028	write(9, " M S I E C r a w l e r )".., 1024)	= 1024
> 4037.3079	llseek(9, 1, SEEK_CUR)				= 585779
> 4037.3085	write(9, " WB7 lA0 1 2 - 2 5 5 - 1".., 1023)	= 1023
> 4037.3125	write(9, " 0 7 2 1 2 0 0 0 h . 0 8".., 526)	= 526
> 4037.3636	poll(0xFFBECF20, 1, 60000)			= 1
> 4037.3642	read(5, " %DFFFFF", 4)				= 4
> 4037.3650	llseek(7, 7733248, SEEK_SET)			= 7733248
> 4037.6213	read(7, "\01A @\0\01A  \0\0 6109E".., 262144)	= 262144
> 4037.6229	    Received signal #16, SIGUSR1 [caught]
>       siginfo: SIGUSR1 pid=27948 uid=0
> 4037.6234	sigaction(SIGUSR1, 0xFFBED188, 0xFFBED208)	= 0
> 4037.6238	sigaction(SIGUSR2, 0xFFBED188, 0xFFBED208)	= 0
> 4037.6913	unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/oral-forum-www.thebody.com/.visitorid.idx.Lkbae3") = 0
> 4037.6921	getpid()					= 28192 [1]
> 4037.6957	write(10, " A\0\0\b r s y n c   e r".., 69)	Err#32 EPIPE
> 4037.6977	    Received signal #13, SIGPIPE [caught]
> 4037.6982	sigaction(SIGUSR1, 0xFFBEC8B0, 0xFFBEC930)	= 0
> 4037.7016	sigaction(SIGUSR2, 0xFFBEC8B0, 0xFFBEC930)	= 0
> 4037.7021	write(10, " A\0\0\b r s y n c   e r".., 69)	Err#32 EPIPE
> 4037.7026	    Received signal #13, SIGPIPE [default]
> 	*** process killed ***




More information about the rsync mailing list