SIGUSR1 or SIGINT error

Sat Feb 23 09:07:49 EST 2002

On Fri, 22 Feb 2002, Dave Dykstra wrote:

> > > At the same time, the server child process is doing a 262K read from another
> > > file over the same NFS connection and it takes the same 9 minutes and
> > > eventually suceeds.  It only dies because the other server process had died
> > > and it gets a signal.
> >
> > What makes you think it's going over NFS?  Everything being backed up is
> > local, and the backup disk is local...as best as I can tell.  The lsof
> > stuff all points to local devices as well.
>
> I just thought /export was likely to be a remote disk, and I couldn't imagine
> it taking 9 minutes to read a local disk.

No, that doesn't make any sense.

> Hold on... I misread the trace.  I just did some simple tests and I found
> that truss timestamps are actually the time that an operation completes,
> not the time they start.  So the long delays are in the operations
> following the reads, not the reads themselves.  That changes everything.
>
> New analysis: the client (the sender) is waiting on a poll to write to the
> network.  The server main process (the generator) is waiting on a poll to
> write to the network in the other direction and possibly also to read from
> the socketpair inter-process communication from it's child process (the
> receiver).  If you use truss "-vpoll" it should tell you exactly what's
> being waited on.  The server child process (the receiver) is waiting on
> a 928 byte write to the file being built on the disk!  It gets stranger and
> stranger.  Is the disk almost full or something?

I will add a -vpoll when it starts happening again.  The disk has at least
5GB free when the backup sare occuring.  I've also rebooted and run fsck
on the drive (it is clean) so there's no obvious filesystem corruption and
such.  Although this problem has happened with one or another machine
going back for a year or two, always irregularly, so I doubt it's a local
disk problem.

So far, the only consistencies on the failures are:

1.  All the backups were handling a lot of files and a lot of disk space
    (>5G, more often on 10G or 20G backups)
2.  All the machines tended to have activity going on while the rsync was
    running, possibly on files that rsync was trying to backup.

> > The plot thickens...it's worked two days in a row now.  But, the next time
> > it fails I will get a snoop and see if TCP lends any clues.  Is it
> > possible that there's some sort of memory clearing, cleanup, or other
> > event going on that could cause the occassional lapse like this?  We
> > certainly haven't seen any other networking problems that would explain
> > this.
>
> It still looks like a disk problem and not a network problem, so I don't
> think a snoop is going to help you.

Agreed.  As soon as it starts to fail again, I will look to see what other
information I can get.

David.