[clug] command line Web scraping ASP.net, Webforms & Gridview + "Man in the MIddle" SSL sniffer

Fri Feb 21 07:01:25 UTC 2020

The problem:

	A friend wants to analyse some data by Electoral Division, but only has addresses, which includes Postcodes.
	He’s done a few thousand records by hand and wanted a better way for the rest.

	There’s an AEC "Locality search” that returns postcodes by Electorate (Division),
	but while they offer *downloads* for many other SA1, polling place and some SA2’s,
	I couldn’t find a download for Postcodes within Divisions.

==========

It took a few minutes to find a list of divisions, (Names and ‘division id’) and throw together a shell script that generated URL’s and save the HTML.

‘hxpipe’ is my tool of preference for looking inside HTML, to extract the table,
‘awk’ is my first stop for simple text processing and prototyping.
For complex work, I move to other tools.

Done & dusted in under 30 minutes?
Not quite…
I’d only fetched the *first* page ( 20 postcodes ) of each division and there was no obvious way to step through the rest…

They don’t put the page number in the URL, but use FORM + submit, driven by Javascript objects.

First problem was the AEC presents a fixed 20 postcodes per page, with up to 5 clickable page numbers across the bottom. [far left & far right are ‘previous’ and ’next’ set of 5 pages, indicated with ‘…’]

Some sites download a large array into a javascript array then throw up a “page” worth as the user clicks through.
These are easy to extract all the data:
	 find the array in a saved page.

Turned to the Browser Debugger for the webpage and javascript console and was able to set breakpoints, but failed to capture what the javascript function called by FORM ’submit’ (used to send a HTTP ‘POST’ request to the server).

Could never ’step into’ the function that was called :(
There was a javascript object created that did the submit, but my debugger/ javascript skills failed at this point.

It took me a while to discover from the javascript [__doPostBack() ] that this was a Microsoft ‘webforms’ application.
Hosted on cloudfare if anyone cares - always got ‘cache misses’ for these database lookups.

I could see a few FORM variables in the saved HTML (including VIEWSTATE & EVENTVALIDATION) that changed sometimes [hadn’t understood 5 pages per set of credentials at this point.]

From reading further, Microsoft use these two FORM variables as ‘credentials’ to address cross-site-scripting security issues. There are other variables as well, but aren’t validated.

Interestingly, the application uses the variable in ‘VIEWSTATE’ to display the Division Name - which might’ve meant I may have as well done this manually.
(I don’t know Mac Automator well enough - that was another solution).

[Q: if anyone knows the X-11 / KDE / GNOME tool(s) for GUI automation, would they post a short answer?]

The database lookup takes the division / electorate name from the URL, not the saved javascript variable.
I was able to use the one set of credentials to fetch HTML pages for many divisions by changing the URL.
A lucky break.

back to the story.

(Stupidly) tried tcpdump to look at the HTTP ‘POST’ data - which was SSL encrypted in the HTTPS connection :(
ssldump didn’t help - the Diffe-Helman style key exchange is proof against snooping :(

No idea how to extract the private keys from the browser to provide to ssldump.

[Q: anyone know how to do this?]

Saw a hint about a Google Chrome extension that allows the plain text, pre-encryption, of HTTP requests to be shown, but I didn’t follow that path.

[Q: anyone know Browser extensions that display unencrypted HTTPS ‘POST’ requests sent to a server?]

Tried a few terminal based browsers (links, elinks, lynx) failed as well.
Either they didn’t provide the links via javascript, or refused to fetch other pages, with an error like “invalid URN scheme”.

After more wasted hours I changed tack and found what turned out to a no-cost Proxy for SSL, that does “man in the middle” extraction of the plain text:

	‘Fiddler’ from telerik, now part of “Progress” Corp. [link below. Now has Mac & Linux clients]
	Not ‘Open source’. Their business model is $1000 for ’support’.

Seems like an act of desperation? yep, it was.

Although I could see variables VIEWSTATE & EVENTVALIDATION in the saved HTML,
I never figured out how to extract it & reuse - obviating the need for the manual SSL capture and manual copy/ paste.

Using this approach I always got a 'Temporarily Unavailable’ page returned - the “Webforms" way of saying “Your credentials failed”.
I’m not sure if I failed to do the urlencoding, added newlines or just wasn’t seeing the changed credentials.

I tried saving and loading ‘cookies’ for the pages in each division, but that didn’t help.
The scraping worked without ‘cookies’.

Maybe someone else could spend the time and make extracting FORM variables from HTML work…
I’ve downloaded 986 HTML pages and created my list, so don’t currently need to investigate further.

I found a Python object, “scrapy", that might understand ‘Webforms’ and be able to solve this problem quickly & generically.

If there are any Python users who’ve read this far.

Q: perhaps you could spend 15 minutes and see if there’s a simple solution to particular screen-scaping problem?

Look forward to seeing that answer - if my friend wants more AEC data, I want a better solution.

Included are lots of links, copies of scripts and the shell fragments I used to do this ‘web scraping’.

cheers
steve jenkin

==========
==========

Example (unanswered) questions on the Interwebs.
Most I read assume javascript was used for scraping / interacting with the website.

How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?
<https://stackoverflow.com/questions/2449328/how-do-i-scrape-information-off-asp-net-websites-when-paging-and-javascript-link>

An unhelpful page… on screenscraping with gridview
<https://www.codeproject.com/Questions/716286/Screenscraping-a-gridview-with-paging>

==========

The Python package that may already neatly solve this problem

SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY
<https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition>

==========

Microsoft:

Understanding ASP.NET View State
<https://docs.microsoft.com/en-us/previous-versions/dotnet/articles/ms972976(v=msdn.10)>

==========

2019 federal election downloads and statistics
<https://www.aec.gov.au/Elections/Federal_Elections/2019/downloads.htm>

	Votes by SA1 [CSV 45.03MB]

SA1 are the second smallest ABS ’Statistical Area’ (smallest now they’ve done to ASGS is ‘mesh objects’).
There are SA1 to Postcode conversion files for download on the ABS site, but this was getting too hard.

List of Divisions
<https://results.aec.gov.au/24310/Website/HouseDivisionalResults-24310.htm>

Locality search page used
<https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=Scullin&filterby=Electorate>

=============

A (free) tool that does ‘Man in the Middle’ for SSL / HTTPS on Windows: ‘Fiddler’
<https://www.telerik.com/fiddler>

the  Mac and Linux version:
<https://www.telerik.com/fiddler-everywhere>

It captures & decodes all packets for a network ‘stream’ using a GUI,
and allowed me to see what was being ‘posted’ to get the next page from AEC Localities.
Tried the “VIEWSTATE” variable (many kB) from that in my command-line ‘wget’ - and it worked!

=============
=============

Shell script to extract single table data, example command line use, for 5 Divisions with a space in name.

grep -e ' '  AEC.divs |\
  while read divid Name N2 State  n
do 
	for ((i=1; $i <= $n ; i=$(( $i + 1 )) )) 
	do  
		f="${divid}_${Name} ${N2}_${i}.html”
		 echo "$f"; cat "$f” |\
			 ./pcode-extr.sh
	 done > pcode-div_${divid}.tsv
done

———

#!/bin/sh
# pcode-extr.sh	 - extract postcodes from HTML, input from STDIN
# Steve Jenkin Fri 21 Feb 2020 13:37:03 AEDT

 hxpipe -|\
	 awk '/^\(table/,/^\)table/ { print }' |\
	 grep -v -e '^[()]a' -e '^Aclass CDATA headingLink' -e '^Ascope CDATA col' -e '^Ahref ' -e '^Aonclick ' |\
	 awk '1,/^Aclass CDATA pagingLink/ {print} /^Aclass CDATA pagingLink/ { exit }' |\
	 awk '
 		/^\(table/ || /^-\\r/ || /^\(t[rhd]/ {next;}
		/^\)t[dh]/ { printf("\t"); next;}
		/^\)tr/ { printf("\n"); next;}
		/^-/ {
			f=$0;
			sub(/^-/,"", f);
			sub(/ /," ",f);
			printf("%s", f);
			next;
		}'

 # leave heading lines in, can remove later.
# grep -v -e '^State’ 

=============

##### 
# for Mac, uses 'pbpaste' to paste contents of 'Paste Buffer', POST data sent in HTTP(S) request
# POST data has no control characters (newlines, tabs), doesn't end with a newline (by default 'echo' adds a newline, must delete or use 'echo -n')
# variables are separated by '&'
# the VIEWSTATE and other variables are doubly encoded:
#	- data is first Base64 encoded
#	- FORM variables are 'urlencoded', '$' becomes %24
#		eg: EVENTARGUMENT="Page%24${i}
##### 

An example of storing the credentials captured for webpage '25', with links for:
	... 25 26 27 28 29 30 …

# store the POST data as a blob, then split into one line per VARIABLE,
# extract the two variables needed as credentials with grep and load some shell variables
# laster used by wget

n=xx_26-30
pbpaste > ${n}
tr '&' '\n' <${n} >${n}.txt
cat ${n}.txt

VIEWSTATE=$(grep VIEWSTATE= ${n}.txt | sed -e 's/^__VIEWSTATE=//' | tr -d '\n')

EVENTVALIDATION=$(grep EVENTVALIDATION= ${n}.txt | sed -e 's/^__EVENTVALIDATION=//' | tr -d '\n')

echo
echo
echo VIEWSTATE " " ${VIEWSTATE}
echo
echo EVENTVALIDATION " " ${EVENTVALIDATION}

###########

join -t$'\t' -o 2.1,2.2,2.3,2.4 t_26-30 AEC.divs| while read divid Name State n
do
    i=26
    while [[ $i -le n ]] && [[ $i -le 31 ]]
	do
	    f="${divid}_${Name}_${i}.html"
	    URL="https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=${Name}&filterby=Electorate&divid=${divid}"
	    echo $f
	    EVENTARGUMENT="Page%24${i}"
	    wget -nc -O "${f}” --post-data="__EVENTTARGET=${EVENTTARGET}&__EVENTARGUMENT=${EVENTARGUMENT}&__VIEWSTATE=${VIEWSTATE}&__VIEWSTATEGENERATOR=${VIEWSTATEGENERATOR}&__EVENTVALIDATION=${EVENTVALIDATION}&ctl00%24ContentPlaceHolderBody%24searchControl%24searchFilter=&ctl00%24ContentPlaceHolderBody%24searchControl%24dropDownListSearchOptions=0" "${URL}"
	    i=$(( $i + 1 ))
	done
done 

# check if credentials copied were valid.
# it proved to be tricky to select the right SSL request in ‘Fiddler’,
# would avoid this approach if possible doing it again

grep -l 'Temporarily Unavailable' ???_*_?.html ???_*_??.html  | tr '\n' '\0' | xargs -0 echo

#########

 /Users/steve/bin/urldecode - simple demonstration of ‘url encoding'

#!/bin/bash
# https://github.com/koenrh/shell-scripts/blob/master/urldecode
# Fri 27 Jul 2018 08:32:11 AEST

if (( $# == 0 )) ; then
  input="$(cat /dev/stdin)"
else
  input="$1"
fi

encoded="${input//+/ }"
printf "%b" "${encoded//%/\\x}"

#########

=============

--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin