help dl all of olympics.com

Simon Burton simonb at webone.com.au
Thu Mar 7 00:50:30 EST 2002


Am trying a websuck on olympics.com,
it's a huge mess of frames, javascript, asp, etc.

$ wget -r -H -nc -R jpg http://olympics.com

is a start, -r to be recursive,
and we need to span hosts (-H) because akamai.net has some of the
content (?), but some hosts we don't want like apple.com.
Not sure how to reject/accept hosts with wget,
but i did work out -R jpg rejects all the pretty pictures.

Am trying to extract results and athlete profiles
for a data-mining project (yes i'm nuts).
Is anyone interested?
I can keep on hacking this on my own (with e.g. python),
but if anyone is curious/has ideas that would be a help.
It's a big project (again). Will i end up manually saving
pages with mozilla...? stay tuned...

-simon






More information about the linux mailing list