[clug] command line Web scraping ASP.net, Webforms & Gridview + "Man in the MIddle" SSL sniffer
Tony Lewis
tony at lewistribe.com
Sat Feb 22 04:29:01 UTC 2020
On 21/2/20 6:01 pm, steve jenkin via linux wrote:
> The problem:
>
> A friend wants to analyse some data by Electoral Division, but only has addresses, which includes Postcodes.
> He’s done a few thousand records by hand and wanted a better way for the rest.
>
> There’s an AEC "Locality search” that returns postcodes by Electorate (Division),
> but while they offer *downloads* for many other SA1, polling place and some SA2’s,
> I couldn’t find a download for Postcodes within Divisions.
>
> ==========
>
> ...
>
> [Q: if anyone knows the X-11 / KDE / GNOME tool(s) for GUI automation, would they post a short answer?]
The tools I'd reach for here are:
* Selenium, to allow you to programmatically control a full,
javascript-enabled browser
* BeautifulSoup, to manipulate the HTML
* Pandas, to extract the tabular HTML data into a DataFrame and
transform it enough
* Python, to write the code that uses both of these
You use Selenium to drive the browser and navigate to a URL, then use BS
to parse that HTML into a DOM. As a one-off, you explore the DOM
looking for where the 'Next page' button is. You might also need to do
the same to find the right HTML table that contains the data you want.
Parsing the HTML in Pandas is sometimes a doddle, sometimes a challenge,
but either way it's a wondrous experience. I love Pandas.
Each part of that toolset has some learning curve, but if you're
familiar with Python and HTML, half the battle is won.
Your pseudocode might look something like:
navigate to the first page of your results
extract the HTML table and create a dataframe
while there is a 'Next page' button:
simulate clicking it
extract the HTML table and append to your growing dataframe
transform, deduplicate your dataframe
write it out to CSV or Excel
> No idea how to extract the private keys from the browser to provide to ssldump.
>
> [Q: anyone know how to do this?]
That's a hard way to crack this nut.
> I found a Python object, “scrapy", that might understand ‘Webforms’ and be able to solve this problem quickly & generically.
>
> If there are any Python users who’ve read this far.
>
> Q: perhaps you could spend 15 minutes and see if there’s a simple solution to particular screen-scaping problem?
>
> Look forward to seeing that answer - if my friend wants more AEC data, I want a better solution.
A bettererer answer would be to see if the AEC publishes an API for
this. There's this, and if you click through, it has results for the
2019 election, but I didn't dig enough to see if it goes down to
postcode level:
https://data.gov.au/dataset/ds-dga-868ab235-f740-44d5-8c43-01e28fc1c017/details?q=
> 2019 federal election downloads and statistics
> <https://www.aec.gov.au/Elections/Federal_Elections/2019/downloads.htm>
>
> Votes by SA1 [CSV 45.03MB]
>
> SA1 are the second smallest ABS ’Statistical Area’ (smallest now they’ve done to ASGS is ‘mesh objects’).
> There are SA1 to Postcode conversion files for download on the ABS site, but this was getting too hard.
I'd pursue this, personally. Is this the GeoPackage data that got too
hard? I'd lean towards using those conversion data and see if a
function could be written that convert the SA to postcode.
If there's a direct conversion of SA1 to postcode, then it might be as
simple as a Pandas 'join' to map them together.
Tony
More information about the linux
mailing list