[clug] command line Web scraping ASP.net, Webforms & Gridview + "Man in the MIddle" SSL sniffer

Tony Lewis tony at lewistribe.com
Sat Feb 22 04:29:01 UTC 2020


On 21/2/20 6:01 pm, steve jenkin via linux wrote:
> The problem:
>
> 	A friend wants to analyse some data by Electoral Division, but only has addresses, which includes Postcodes.
> 	He’s done a few thousand records by hand and wanted a better way for the rest.
>
> 	There’s an AEC "Locality search” that returns postcodes by Electorate (Division),
> 	but while they offer *downloads* for many other SA1, polling place and some SA2’s,
> 	I couldn’t find a download for Postcodes within Divisions.
> 	
> ==========
>
> ...
>
> [Q: if anyone knows the X-11 / KDE / GNOME tool(s) for GUI automation, would they post a short answer?]

The tools I'd reach for here are:

  * Selenium, to allow you to programmatically control a full,
    javascript-enabled browser
  * BeautifulSoup, to manipulate the HTML
  * Pandas, to extract the tabular HTML data into a DataFrame and
    transform it enough
  * Python, to write the code that uses both of these

You use Selenium to drive the browser and navigate to a URL, then use BS 
to parse that HTML into a DOM.  As a one-off, you explore the DOM 
looking for where the 'Next page' button is.  You might also need to do 
the same to find the right HTML table that contains the data you want.  
Parsing the HTML in Pandas is sometimes a doddle, sometimes a challenge, 
but either way it's a wondrous experience.  I love Pandas.

Each part of that toolset has some learning curve, but if you're 
familiar with Python and HTML, half the battle is won.

Your pseudocode might look something like:

    navigate to the first page of your results

    extract the HTML table and create a dataframe

    while there is a 'Next page' button:

        simulate clicking it

        extract the HTML table and append to your growing dataframe

    transform, deduplicate your dataframe

    write it out to CSV or Excel


> No idea how to extract the private keys from the browser to provide to ssldump.
>
> [Q: anyone know how to do this?]

That's a hard way to crack this nut.


> I found a Python object, “scrapy", that might understand ‘Webforms’ and be able to solve this problem quickly & generically.
>
> If there are any Python users who’ve read this far.
>
> Q: perhaps you could spend 15 minutes and see if there’s a simple solution to particular screen-scaping problem?
>
> Look forward to seeing that answer - if my friend wants more AEC data, I want a better solution.

A bettererer answer would be to see if the AEC publishes an API for 
this.  There's this, and if you click through, it has results for the 
2019 election, but I didn't dig enough to see if it goes down to 
postcode level: 
https://data.gov.au/dataset/ds-dga-868ab235-f740-44d5-8c43-01e28fc1c017/details?q=

> 2019 federal election downloads and statistics
> <https://www.aec.gov.au/Elections/Federal_Elections/2019/downloads.htm>
>
> 	Votes by SA1 [CSV 45.03MB]
>
> SA1 are the second smallest ABS ’Statistical Area’ (smallest now they’ve done to ASGS is ‘mesh objects’).
> There are SA1 to Postcode conversion files for download on the ABS site, but this was getting too hard.

I'd pursue this, personally.  Is this the GeoPackage data that got too 
hard?  I'd lean towards using those conversion data and see if a 
function could be written that convert the SA to postcode.

If there's a direct conversion of SA1 to postcode, then it might be as 
simple as a Pandas 'join' to map them together.

Tony




More information about the linux mailing list