help dl all of olympics.com

Richard Cottrill richard_c at tpg.com.au
Thu Mar 7 02:07:47 EST 2002


Given the nature of the beastie I'd be tempted to use some evil java(script)
to script a browser based thingy to belt through all of the pages. I
considered a similar problem in an attempt to rig a web-based poll - but did
some lazy Perl stuff instead. I thought it might be easier to leverage all
of the hard work that's gone into the browser rather than write a rough
javascript interpreter.

I ignored all the complex bits in the end though and I didn't manage to rig
the poll (it was the most crass attempt possible) and now Nitocris have
split up and will never top the Hottest 100 :(

Richard

> -----Original Message-----
> From: linux-admin at lists.samba.org [mailto:linux-admin at lists.samba.org]On
> Behalf Of Simon Burton
> Sent: Wednesday, March 06, 2002 1:51 PM
> To: linux at samba.org
> Subject: help dl all of olympics.com
>
>
> Am trying a websuck on olympics.com,
> it's a huge mess of frames, javascript, asp, etc.
>
> $ wget -r -H -nc -R jpg http://olympics.com
>
> is a start, -r to be recursive,
> and we need to span hosts (-H) because akamai.net has some of the
> content (?), but some hosts we don't want like apple.com.
> Not sure how to reject/accept hosts with wget,
> but i did work out -R jpg rejects all the pretty pictures.
>
> Am trying to extract results and athlete profiles
> for a data-mining project (yes i'm nuts).
> Is anyone interested?
> I can keep on hacking this on my own (with e.g. python),
> but if anyone is curious/has ideas that would be a help.
> It's a big project (again). Will i end up manually saving
> pages with mozilla...? stay tuned...
>
> -simon
>
>
>





More information about the linux mailing list