[clug] Perl (or python) page scraping tools

Tony and Robyn Lewis beakysnugger at yahoo.co.uk
Tue Aug 8 13:35:48 GMT 2006


Michael James wrote:
> Anyone got any recommendations of modules for scraping HTML?
>   

Apparently the module of choice is Mechanize: 
http://search.cpan.org/dist/WWW-Mechanize/.  It can handle forms and 
cookies.  I haven't used it (as I went for Python below).

http://www.perl.com/pub/a/2003/01/22/mechanize.html: a useful 
walkthrough of scraping TV times from the BBC's website

For python, there is BeautifulSoup: 
http://www.crummy.com/software/BeautifulSoup/.  No forms or cookies.  
Ends up creating a nested structure of all the HTML elements.

http://www.crummy.com/software/BeautifulSoup/examples.html: some examples.

HTH,

Tony Lewis



More information about the linux mailing list