Fixed: OpLocks caused the corruptions/slowness (Was: How Samba let us down)

Bryan J. Smith b.j.smith at ieee.org
Fri Nov 22 22:36:04 GMT 2002


Quoting Russell Senior <seniorr at aracnet.com>:
> I *still* don't understand how flaky hardware could be the problem.
> TCP connections are supposed to be reliable.  If flaky hardware is
> eating packets, then surely the sender, failing to get a timely ACK
> will resend?

Yes, you've begun to understand some of the problems.  You have lots of drops,
even more resends, and eventual timeouts.  The reliability of TCP can actually
agitate the problem to total ineffectiveness.

First thing one should always do when issues arise is make a visit to the
ole'wiring closet and look at those hub/switch ports.  If traffic/collisions are
high, then you should start investigating the hardware.

The flip could be that the network card is just not receiving, but there is not
enough traffic to result in the above.  That has happened to me several times. 
When I start getting drops in NFS/Samba, I replace the NIC with the exact same
model the next time I can down the system.

You'd be surprised how much NICs are affected by heat, and as they age, they
become more suseptible to it.  I _have_ solved such issues with more active
cooling on my case and/or DIP heat sinks.  Yes, even those little 84/100-pin
QFP/TFP chips can overheat.  The whole PCI/AGP card arrangement makes it worse,
with the chips under the PCB in a vertical tower.

> I can understand a flaky client getting the break and
> not responding with the appropriate action, but I don't understand how
> a hardware level problem can break TCP communication, except in the
> obvious and persistent way of not having a connection at all.

All it takes is intermediate drops to have it result in the same thing.

> Please someone, wump me with a clue stick.

The issue is that if you talk about hardware issues, you must then look at the
actual hardware-level of the OSI model.  There's not much tolerance there,
despite the higher levels.

> I have seen the problem on a small (two or three host) 100Mbit
> switched network.

-- 
Bryan J. Smith, E.I.             Contact Info:  http://thebs.org
A+/i-Net+/Linux+/Network+/Server+  CCNA CIWA CNA SCSA/SCWSE/SCNA
----------------------------------------------------------------
The reason why Microsoft is so successful is because they fooled
an overwhelming majority of people into thinking there is such a
thing as a "free lunch."  The DOJ case has arisen from those
same consumers who won't accept the fact that they were respon-
sible for locking themselves into only one company's products.




More information about the samba-technical mailing list