[clug] Perl (or python) page scraping tools
Tony and Robyn Lewis
beakysnugger at yahoo.co.uk
Tue Aug 8 13:35:48 GMT 2006
Michael James wrote:
> Anyone got any recommendations of modules for scraping HTML?
>
Apparently the module of choice is Mechanize:
http://search.cpan.org/dist/WWW-Mechanize/. It can handle forms and
cookies. I haven't used it (as I went for Python below).
http://www.perl.com/pub/a/2003/01/22/mechanize.html: a useful
walkthrough of scraping TV times from the BBC's website
For python, there is BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/. No forms or cookies.
Ends up creating a nested structure of all the HTML elements.
http://www.crummy.com/software/BeautifulSoup/examples.html: some examples.
HTH,
Tony Lewis
More information about the linux
mailing list