[clug] IDE Stalling On nForce2 Chipset?
Alex Satrapa
grail at goldweb.com.au
Tue Aug 10 10:57:24 GMT 2004
I've had problems with an nVidia nForce 2 based chipset stalling during
writes to the ATA drives. The system in particular that's causing the
most grief is our main LDAP server, but the times it appears to choke
are when rsync is running (we do snapshot backups between three
machines).
A typical symptom is that the output of "vmstat 1" will scroll by with
a new row ever second (as expected), showing output like this:
r b swpd free buff cache si so bi bo in cs us
sy id wa
2 0 0 9408 203868 488132 0 0 0 1996 719 1339 0
0 100 0
1 0 0 9328 203868 488212 0 0 0 1936 716 1332 0
0 100 0
1 0 0 9356 203868 488184 0 0 0 1956 747 1346 0
0 100 0
then, seemingly out of nowhere, the output will stall for 10 to 20
seconds, after which vmstat "catches up":
2 1 0 12452 204156 483252 0 0 508 19576 10570 4056 0
1 99 0
2 2 0 11212 204932 483708 0 0 776 3992 3037 2979 2
3 95 0
0 4 0 10500 205592 483756 0 0 660 3416 793 1025 0
3 97 0
10 thousand interrupts per second? I don't think so!
Running Linux 2.4.26. lspci:
0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different
version?) (rev a2)
0000:00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1
(rev a2)
0000:00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4
(rev a2)
0000:00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3
(rev a2)
0000:00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2
(rev a2)
0000:00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5
(rev a2)
0000:00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
0000:00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
0000:00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller
(rev a4)
0000:00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller
(rev a4)
0000:00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller
(rev a4)
0000:00:04.0 Ethernet controller: nVidia Corporation nForce2 Ethernet
Controller (rev a1)
0000:00:05.0 Multimedia audio controller: nVidia Corporation nForce
MultiMedia audio [Via VT82C686B] (rev a2)
0000:00:06.0 Multimedia audio controller: nVidia Corporation nForce2
AC97 Audio Controler (MCP) (rev a1)
0000:00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge
(rev a3)
0000:00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
0000:00:0d.0 FireWire (IEEE 1394): nVidia Corporation nForce2 FireWire
(IEEE 1394) Controller (rev a3)
0000:00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev a2)
0000:01:06.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro
100] (rev 0c)
0000:02:00.0 VGA compatible controller: nVidia Corporation NV18
[GeForce 4MX - nForce GPU] (rev a3)
There is nothing in syslog, dmesg or kern.log that would explain what's
going on. I've checked for "nForce2 stalling", with or without Linux in
there, all I get is a bunch of people making statements like "check in
the device manager to see if DMA is turned on". Which made me think
about hdparm, but unless hdparm is lying to me, everything's fine as
far as DMA is concerned:
(this disk is a Seagate drive, Model=ST3120026A, FwRev=3.06)
/dev/ide/host0/bus0/target0/lun0/disc:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 14593/255/63, sectors = 234441648, start = 0
Where else should I look to find hints as to what is actually going on
with this hardware? It's driving me crazy, especially because this
machine is also the primary LDAP server. When it stalls for 20 seconds,
we end up bouncing two or three email messages, applications will choke
while trying to write to the shared folders, and demons spring forth
from people's noses.
Is there perhaps some magic in /proc that I'm missing out on? I can't
find anything about limiting the number of context switches, but I can
find plenty of references to system calls that tell the application how
many voluntary and involuntary context switches have occurred (I'm
guessing voluntary due to calling system functions that require a
switch in to kernel space, and involuntary due to scheduler booting the
task out to the run queue?)
Is the vmstat output telling me that that what's happening is that the
number of context switches is just getting too high, and some vital
interrupt is getting masked for too long?
Any hints would be greatly appreciated!
"You are trapped in a maze of screens and ssh sessions all alike."
"It is dark, and you are likely to log off the wrong account."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.samba.org/archive/linux/attachments/20040810/a8a79c61/PGP.bin
More information about the linux
mailing list