[clug] Feedback...

steve jenkin sjenkin at canb.auug.org.au
Wed Jan 20 17:57:37 MST 2010

Angus Gratton wrote on 21/01/10 10:27 AM:
> It's an interesting story (and cautionary tale), thanks for sharing it.
> On Wed, Jan 20, 2010 at 2:19 PM, Martin Schwenke <martin at meltin.net> wrote:
>> max_wait is how long approx will wait before it tries to retrieve
>> another copy of the file.  The default value for max wait is 10.
>> That's 10 *seconds*.
> I'd bet the developer responsible for the max_wait default lives
> somewhere like Western Europe or Japan, has some kind of 100Mbps
> internet connection, and is wondering what all the fuss is about. ;).
> I think this kind of thing happens a lot, though. We develop a
> software service at my work, and we've certainly had naive client
> implementations that will send a complicated request that requires
> heavy processing, time out quickly (because most requests are handled
> quickly), and then reconnect and send it again, over and over.
> Defensive design required to prevent a complete DoS.
> - Angus

On the topic of (bad) "network assumptions".
['twas ever thus. And 'till ever be :-(]

My erratic memory says I was told once of an rogue PC (at Uni of NSW?)
that caused a significant change in the TCP/IP standard - number of
unacknowledged packets outstanding, or perhaps "slow-start"...

This PC was locally connected at 10/100Mbps and was making a TCP
connection off-campus (FTP?) (somewhat slower link). It sent packets as
fast as it could - and nailed the link, plus crowded router buffers etc.

Caused quite a problem... Not unlike Martin's 3Gb waste-of-space.

This must have been on AARNET mk.1, post 1989.

Any of our network boffins know the real story :-) ??


PS: There is a whole class of network problem which I've never heard a
good name for ("cascade fault"?) like Martin's - some 'error recovery'
mechanism kicks in and compounds the fault.
Often causing accelerating damage and turning a recoverable/borderline
situation into a guaranteed (massive) failure.

I first saw this at TNT - they were running TCP over X.25. (1995/6 !)
When the links saturated, queues in switch-buffers grew & grew.
TCP packets were forced to queue in both directions - and
round-trip-time exceeded timeout for ACK, and forced a retransmit
(someone know the time values?). Just when the network was on its knees,
the load multiplied (doubled, then more).

The surprisingly small $1M central switch would run out of buffers, do a
'reset' and dump all connections. Problem solved :-)

Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin

More information about the linux mailing list