update: help dl all of olympics.com

Simon Burton simonb at webone.com.au
Sat Mar 9 15:42:49 EST 2002


hi people

i tried using accept/reject with (a recursive)
wget (thanks to jeremy for help),
but basically wget can't get the meat
out of all the java script etc. etc., so

am now investigating ways of pumping mozilla,
either manual or automatic.

More detail:
1) all moz requests go through (a local) squid
2) after examining the squid logs, it seems all the meat (data) is in 
.htm files (good)
3) squid has a URL redirect option (this is how the local ad zapper 
works), so
4) use a redirect program to log all .htm files, possibly
5) parse all hrefs out of said htm files and
6) pipe these back to mozilla somehow, perhaps
7) send X key events (and double clicks?) to address feild in moz (i use 
python-xlib for sending X events), or
8) hack moz; i keep on hearing about moz scripting

comments?

-simon


Simon Burton wrote:

> Am trying a websuck on olympics.com,
> it's a huge mess of frames, javascript, asp, etc.
> 
> $ wget -r -H -nc -R jpg http://olympics.com
> 
> is a start, -r to be recursive,
> and we need to span hosts (-H) because akamai.net has some of the
> content (?), but some hosts we don't want like apple.com.
> Not sure how to reject/accept hosts with wget,
> but i did work out -R jpg rejects all the pretty pictures.
> 
> Am trying to extract results and athlete profiles
> for a data-mining project (yes i'm nuts).
> Is anyone interested?
> I can keep on hacking this on my own (with e.g. python),
> but if anyone is curious/has ideas that would be a help.
> It's a big project (again). Will i end up manually saving
> pages with mozilla...? stay tuned...
> 
> -simon
> 
> 
> 
> 






More information about the linux mailing list