[clug] [SOLUTION] command line Web scraping ASP.net, Webforms & Gridview

steve jenkin sjenkin at canb.auug.org.au
Sat Feb 22 07:24:52 UTC 2020


Developed a bash / wget solution for anyone discovering this thread and wanting to web-scrape similar Webforms and having trouble.

Checked for one division and produces identical output.

Thanks for everyone who offered advice and help - will make a big difference in my next adventure down these roads.

The biggest hurdle was seeing the FORM / POST data being sent over SSL to the webserver - the tricks & tools suggested to help with that are generic and way, way better than the GUI tool I found.

> On 21 Feb 2020, at 18:01, steve jenkin via linux <linux at lists.samba.org> wrote:
> 
> The problem:
> 
> 	A friend wants to analyse some data by Electoral Division, but only has addresses, which includes Postcodes.
> 	He’s done a few thousand records by hand and wanted a better way for the rest.
> 
> 	There’s an AEC "Locality search” that returns postcodes by Electorate (Division),
> 	but while they offer *downloads* for many other SA1, polling place and some SA2’s,
> 	I couldn’t find a download for Postcodes within Divisions.	

==========

The master file used to drive this script has 4 fields per line.
the last field is the number of webpages per District, each having up to 20 lines - found manually.

steve$  head AEC-divs.tsv

101	Canberra	ACT	3
102	Fenner	ACT	2
103	Banks	NSW	2
104	Barton	NSW	2
105	Bennelong	NSW	2
106	Berowra	NSW	3
107	Blaxland	NSW	2
108	Bradfield	NSW	2
109	Calare	NSW	20
111	Chifley	NSW	2

This file layout creates a shell problem that could be avoided if the Name (2nd field) was the last field.

There are 5 Names that contain a space, which the shell, because of the default IFS of "tab or space or newline", expands into more variables. There’s another two Names with non-Alpha chars [shown below in a grep]

Naive and dangerous: 2nd word of Division Name breaks assumption of 4 variables read:

	cat AEC-divs.tsv  |   while read divid Name State n ; do echo $n; done	 # Name, State and ’n’ are wrong

This fragment works correctly. Uses a sub-shell ‘(…)’ to limit the scope of the IFS change.

	(IFS=$'\t'; cat AEC-divs.tsv | while read divid Name State n ;do echo $n; done)

 steve$ cut -f2 AEC-divs.tsv  | grep -Ev -e '^[A-Za-z]+$’

Eden-Monaro
Kingsford Smith
New England
North Sydney
Wide Bay
La Trobe
O'Connor


==========


#!/bin/bash
# get-div	- fetch all HTML 'Locality' pages for a single AEC Division.
#		  HTML files dropped in local directory, but won't be overwritten.
#		  Check for failed page loads, 'Temporarily Unavailable' (due to bad credentials), but keep going.
#		  Stopping on error and explaining it is right thing to do.
# Steve Jenkin Sat 22 Feb 2020 17:33:47 AEDT

# rudimentary & unchecked argument passing.

divid=${1:-109}
Name=${2:-Calare}
n=${3:-20}

# Ideally, should fetch these (unchanging) values from HTML of 1st page
EVENTTARGET='ctl00%24ContentPlaceHolderBody%24gridViewLocalities'
VIEWSTATEGENERATOR='DC0A9FCB'

### get first page. Default to 'Page 1', loads VIEWSTATE and EVENTVALIDATION
i=1
f="${divid}_${Name}_${i}.html"
URL="https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=${Name}&filterby=Electorate&divid=${divid}"
echo $f  >/dev/tty
wget -nc -O "${f}" "${URL}"; 

for ((strt=2; $strt <= $n ; strt=$(( $strt + 5 )) )) 
do
    end=$(( $strt + 4 ));

# must renew credentials after every 5 pages 
# cheating, use last value in "$f"

    VIEWSTATE=$(urlencode $(grep 'VIEWSTATE"' $f | hxpipe - | grep -e 'Avalue CDATA ' | sed -e 's/^Avalue CDATA //' | tr -d '\n') )
    EVENTVALIDATION=$( urlencode $(grep 'EVENTVALIDATION"' $f | hxpipe - | grep -e 'Avalue CDATA ' | sed -e 's/^Avalue CDATA //' | tr -d '\n' ) )

    for ((i=$strt; $i <= $n && $i <= $end ; i=$(( $i + 1 )) )) 
    do
	f="${divid}_${Name}_${i}.html"
	URL="https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=${Name}&filterby=Electorate&divid=${divid}"
	echo $f >/dev/tty
	EVENTARGUMENT="Page%24${i}"
	wget -nc -O "${f}" --post-data="__EVENTTARGET=${EVENTTARGET}&__EVENTARGUMENT=${EVENTARGUMENT}&__VIEWSTATE=${VIEWSTATE}&__VIEWSTATEGENERATOR=${VIEWSTATEGENERATOR}&__EVENTVALIDATION=${EVENTVALIDATION}&ctl00%24ContentPlaceHolderBody%24searchControl%24searchFilter=&ctl00%24ContentPlaceHolderBody%24searchControl%24dropDownListSearchOptions=0" "${URL}"
    done

    # if these pages are found, it's an error and the script should exit
    grep -l 'Temporarily Unavailable' ???_*_?.html ???_*_??.html  | tr '\n' '\0' | xargs -0 echo

done

==========

--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin




More information about the linux mailing list