[clug] command line Web scraping ASP.net, Webforms & Gridview + "Man in the MIddle" SSL sniffer

steve jenkin sjenkin at canb.auug.org.au
Sat Feb 22 04:59:37 UTC 2020


Tony,

very helpful - thanks so much.

cheers
steve

Out of order response first.

>> 2019 federal election downloads and statistics
>> <https://www.aec.gov.au/Elections/Federal_Elections/2019/downloads.htm>
>> 
>> 	Votes by SA1 [CSV 45.03MB]
>> 
>> SA1 are the second smallest ABS ’Statistical Area’ (smallest now they’ve done to ASGS is ‘mesh objects’).
>> There are SA1 to Postcode conversion files for download on the ABS site, but this was getting too hard.
> 
> I'd pursue this, personally.  Is this the GeoPackage data that got too hard?  I'd lean towards using those conversion data and see if a function could be written that convert the SA to postcode.
> 
> If there's a direct conversion of SA1 to postcode, then it might be as simple as a Pandas 'join' to map them together.
> 
> Tony

I think I’m going to have to pursue this line for a very simple reason:

	redistributions - postcodes are fixed, Electorate boundaries aren’t.

Guess I’ll roll up my sleeves and try to dig into the world of GIS - but taking a gander at SA1-4 data, these are unique database keys - maybe I can escape ‘boundaries’ for a while yet :)


> On 22 Feb 2020, at 15:29, Tony Lewis via linux <linux at lists.samba.org> wrote:
> 
> 
>> 
>> [Q: if anyone knows the X-11 / KDE / GNOME tool(s) for GUI automation, would they post a short answer?]
> 
> The tools I'd reach for here are:
> 
> * Selenium, to allow you to programmatically control a full,
>   javascript-enabled browser
> * BeautifulSoup, to manipulate the HTML
> * Pandas, to extract the tabular HTML data into a DataFrame and
>   transform it enough
> * Python, to write the code that uses both of these
> 
> You use Selenium to drive the browser and navigate to a URL, then use BS to parse that HTML into a DOM.  As a one-off, you explore the DOM looking for where the 'Next page' button is.  You might also need to do the same to find the right HTML table that contains the data you want.  Parsing the HTML in Pandas is sometimes a doddle, sometimes a challenge, but either way it's a wondrous experience.  I love Pandas.
> 
> Each part of that toolset has some learning curve, but if you're familiar with Python and HTML, half the battle is won.
> 
> Your pseudocode might look something like:
> 
>   navigate to the first page of your results
> 
>   extract the HTML table and create a dataframe
> 
>   while there is a 'Next page' button:
> 
>       simulate clicking it
> 
>       extract the HTML table and append to your growing dataframe
> 
>   transform, deduplicate your dataframe
> 
>   write it out to CSV or Excel


Very helpful - thanks for spelling out the tools and steps.


> 
>> No idea how to extract the private keys from the browser to provide to ssldump.
>> 
>> [Q: anyone know how to do this?]
> 
> That's a hard way to crack this nut.

Best advice ever :)


>> I found a Python object, “scrapy", that might understand ‘Webforms’ and be able to solve this problem quickly & generically.
>> 
>> If there are any Python users who’ve read this far.
>> 
>> Q: perhaps you could spend 15 minutes and see if there’s a simple solution to particular screen-scaping problem?
>> 
>> Look forward to seeing that answer - if my friend wants more AEC data, I want a better solution.
> 
> A bettererer answer would be to see if the AEC publishes an API for this.  There's this, and if you click through, it has results for the 2019 election, but I didn't dig enough to see if it goes down to postcode level: https://data.gov.au/dataset/ds-dga-868ab235-f740-44d5-8c43-01e28fc1c017/details?q=

I couldn’t find an API for postcode data.

I think your final piece of advice [moved to top] aces it… No shortcut to this.

cheers
steve


--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin




More information about the linux mailing list