Dell (Agere) orinoco gets excessive retries and other probs

Mon Feb 25 15:19:02 EST 2002

Wow!  It's good to get such a thorough and useful bug report.  Sorry
I've taken so long to do anything about it. 

On Mon, Feb 18, 2002 at 12:01:13PM -0800, Jim Carter wrote:
> Problem summary: the card gets Tx status 1 (excessive retries) on every
> packet and says it's at 2 Mb/s however you set it.  It also gets into
> a mode where you ask for one RID and it gives you a different one.
> 
> I have the Dell (Agere ORiNOCO) ``TrueMobile 1150'' internal 802.11b
> wireless NIC (mini-PCI) in my laptop.  (For complete setup details, see
> http://www.math.ucla.edu/~jimc/insp4100/index.html, and under that,
> wireless.html).  It has Agere firmware v6.16.  It behaves as if in a
> third PCMCIA slot, and uses orinoco_cs.o.
> 
> In the server machine is a Linksys WPC11 Network PC Card version 2.5,
> with Intersil firmware v1.00.  It's in their WDT11 PCI adapter, which
> has a PLX PCI9052 PCMCIA bridge chip.  orinoco_plx.o works for it.
> 
> Software on both machines is SuSE 7.3, kernel 2.4.16, ORiNOCO drivers
> 0.09b (taken from 2.4.18pre9 sources but compiled in 2.4.16 background),
> and wireless tools v22 (wireless extensions v12).
> 
> When using driver v0.07, the Linksys card would often get in a mode
> where it got "eth2: Error -110 writing packet header to BAP"
> (ETIMEDOUT)

Yes, I believe this was due to a bug in the IRQ handler (fixed in
0.09) which would wake up the transmit queue on every interrupt
regardless of whether the card was ready to accept another packet.

> on most packets.  With driver v0.09b this error rarely appears, and the
> Linksys card gives the impression of functioning optimally.  However, in
> file transfer tests with v0.09b, several tests showed a sequence, all
> with the same timestamp, likely during uploads (transfers from the Dell
> card to the Linksys card), where it reset the Linksys card for no

The only way the card should be reset (other than operator
intervention) is a Tx timeout.  The reset is most likely associated
with the first timeout you see.

> obvious reason, then got about 5 timeouts, then "eth2: Tx error, status
> 4 (FID=010E)" or 0135 or both (only those values), then sometimes "eth2:
> Unknown Rx error (0x3). Frame dropped."

The "Tx error, status 4" are a consequence of the reset.  Status 4 is
a "disconnected" error.  This happens because after the reset we
immediately attempt to send queued packets, before the card has had
time to reassociate with the WLAN.

> When using driver v0.07, the Dell card gave the impression of
> functioning perfectly.  It would associate with the Linksys card and
> with various Cisco (Aironet) access points, and would transfer data with
> either one at 11 Mb/s (more like 5 Mb/s judging from timing file
> transfers, which I understand is typical for 802.11b).  It never (?)
> got

Yes, because of ACKs and other overheads in 802.11 5Mb/s is the
maximum throughput you're likely to see at the TCP level.

> ETIMEDOUT.  (This error was seen twice, over many days experience, with
> driver 0.09b.)

Yes, the Agere firmwares appear to have been more resilient to the bug
which caused the ETIMEDOUT errors.  Do you mean you have seed
ETIMEDOUT twice since switching to 0.09b.  Two in several days use is
interesting, but probably not significant.

> Using driver v0.09b, the Dell card now has three problems.  First, it
> will only associate at 2 Mb/s with the Linksys card and the AP's (per
> CURRENTTXRATE and verified by remote -> Dell file transfers).  Setting
> TXRATECONTROL on the Dell and Linksys cards to all combinations of legal
> values leads to the Dell card claiming to be at 2 Mb/s and the Linksys
> card being at whatever you set it to, except not 11 Mb/s.  Here are some
> samples; all rates are in Mb/s.  The file transferred was 1.25 Mb (10^7
> bits).  "Download" means Linksys -> Dell.  SNR = 29 dB and there were no
> competing stations.
> 
> Dell rate	Linksys rate	File Xfr (sec)	Rate	Ping
> Set	Says	Set	Says	Upload	Downld	Downld	RTT (msec)
> 1	2	1	1	51	16	0.6	0.550, 3.8 alternating
> 1	2	2	2	50	11	0.9	4.0-5.6
> 1	2	5.5	5.5	50	7	1.4	3.0-6.1
> 1	2	11	5.5	(not tested)
> 2	2	11	5.5	29	6	1.7	2.6-4.0
> 5.5	2	11	5.5	15	6	1.7	2.3-3.9
> 11	2	11	5.5	11	6	1.7	2.4-3.8
> 
> A file transfer test from a Cisco access point gave a download rate of
> 2.2 Mb/s, but the rate was likely overestimated and 1.4 to 1.7 Mb/s is
> probably closer to reality.  A file transfer from the Linksys card
> (under Linux) to the Dell card under WinXP gave a download rate of 1.2
> Mb/s, claiming the card was at 1 Mb/s.  Uploads were similarly slow as
> in the above table.
> 
> Evidently the Linksys card is sending data at the rate to which it was
> set (up to 5.5 Mb/s).  So why is the Dell card ignoring TXRATECONTROL
> entirely?

I'm aware that there are problems with controlling the Tx rate.
Unfortunately there are several parameters on the card which influence
this, and I do not have documentation which makes tracking the problem
down tricky.  The information you've got here will certainly help.

At an initial guess, I suspect the Dell card may be transmitting at
the rates set, but the Linksys card not accecpting packets at the
higher rates causing the Dell to fall back to lower rates (which may
also be the cause of the "status 1" errors you mention below.

> The second problem, probably the cause of the first, is that every
> packet sent from the Dell card gets "Tx error, status 1" which is
> "excessive retries".  (If one or two packets escaped the error, I
> couldn't tell, but over 99% of the packets are affected.)  Every (?)
> packet arrived, only once, at the Linksys card.  I can't prove that
> there were no duplicates, but ping doesn't show any, and also shows no
> lost packets.  Occasional packets during file transfers arrived trashed
> at the Linksys card, e.g. it says "undecryptable" or "misc error" --
> less than 0.1%.  The behavior is identical with 104-bit WEP, 40-bit WEP,
> or no WEP.

Ok.  Again, very useful data, but I don't have a conclusion yet.

> So, judging from the results above, the Dell card sends data at the speed
> to which it was set, and the data arrives, but the card believes it has to
> resend, possibly falling back to lower speeds, which takes time.  Why is
> this happening, specifically on the 0.09b driver?  My solution to the
> problem is selective denial: I skip the error message for status 1.  With
> this "fix" the wireless link is useable, whereas the timeouts under v0.07
> made it unuseable, even if it would do 11 Mb/s until it broke.

Agreed.

> During the file transfer tests, the driver at the Dell end reported
> "eth1: Tx error, status 4 (FID=00BF)", 7 times out of about 11000
> packets.  From hermes.h this seems to be "disconnect".  FID was always
> the same, 00BF.  The errors happened on separate tests, and not on every
> test.

Are these occuring along with resets, or isolated?  If isolated I
suspect this is not a real problem.  I haven't read the spec closely
enough to know for sure, but I suspect occasional disconnects may be
expected behaviour in 802.11.  We should probably try to suppress the
messages somehow, though.

> The third problem is that when a lot of data comes in to the Dell card,
> it invariably gets into a mode where hermes_read_ltv reads a
> configuration or information record, but the card gives it a different
> RID than it asked for.  Possibly it gets the numerically previous RID
> (e.g. ask for OWNMACADDR but get PORTTYPE), but I have the distinct
> impression that the delivered data is the RID previously asked for.  (In
> orinoco_proc_get_hermes_recs the RIDs are read in numerical order.)

I think you're correct, I have seen this behaviour of returning the
RID previously asked for (although IIRC it was on Intersil firmware).

> If the Linksys card is doing the same thing, I might not notice, because
> I'm running a link quality monitor on the laptop but not on the server.

Ok.

> I do have a fix for this (in a separate message), based on the old
> dormitory adage, "flush twice, it's a long way to the kitchen".  If
> hermes_read_ltv gets an unasked RID, if it repeats the seek and read, it
> will almost always get the right one the second time.  Except, on the
> Linksys card, in the midst of a file download if you "grep" for a record
> in /proc/hermes/eth2/recs, the value will usually but not always be for
> a different RID, even with two seek-reads, and several "rid does not
> match type" errors were syslogged despite two seek-reads.  So I'm trying
> up to 8 retries. Fortunately critical uses of hermes_read_ltv only
> appear in determine_firmware and orinoco_init.
> 
> I did a "diff" comparing orinoco.c for v0.09b vs v0.07.  Substantive
> changes are these:
> 
>     ieee802_11.h was created.  The layout and constants look unchanged.
> 
>     Alternate_encaps was added and the old encapsulation was removed in
>     0.08b.l
> 
>     The bitrate is set in a separate subroutine.  I can't see any
>     substantive difference from the old code, and TXRATECONTROL is set
>     correctly.
> 
>     Only the new version has hermes_write_regn(hw, TXCOMPLFID, DUMMY_FID).
>     Commenting it out had no apparent effect, did not help.

The point of this is so that we can detect Tx completed events even if
the bits in EVSTAT are not set correctly - there is some evidence that
this happens sometimes.

>     The new version does this:
> 	hdr.desc.tx_control = cpu_to_le16(HERMES_TXCTRL_TX_EX |
> 						HERMES_TXCTRL_TX_OK);
>     I'm sure the equivalent is also in 0.07 but I missed it in scanning
>     the diff.

No, that's new - this requests interrupts on (successful) transmit
complete and on transmit errors (excessive retries, disconnects etc.).
Because of this change 0.07 could have gotten the equivalent various
"Tx error" messages you've seen, but this would never have been reported.

>     Only the new version does atomic_inc(&priv->queue_length)

This is some (incomplete) groundwork for a better way of regulating
the waking/stopping of the Tx queue and is currently meaningless.

>     In hermes.c, the new version actually waits for the command register
>     to be not busy, whereas the old version checked once and gave an
>     error message (never seen by me).
> 
>     In orinoco_cs.c, resetting the card was reconditionalized, and there
>     was a comment in 0.07 about messing up old Lucent firmware.  I
>     commented out that reset (in 0.09b) but it had no apparent effect,
>     did not help.

That shouldn't affect you, AFAIK it only affects *really* old Lucent
firmware (earlier than 4.00).

-- 
David Gibson			| For every complex problem there is a
david at gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.  -- H.L. Mencken
http://www.ozlabs.org/people/dgibson